Coder Social home page Coder Social logo

openpreserve / fido Goto Github PK

View Code? Open in Web Editor NEW
141.0 36.0 45.0 44.68 MB

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.

Home Page: http://openpreservation.org/technology/products/fido/

License: Other

Python 98.66% Makefile 0.43% Dockerfile 0.91%

fido's Introduction

Format Identification for Digital Objects (fido)

By Open Preservation Foundation

Build Status Code Coverage

FIDO is a command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.

FIDO uses the UK National Archives (TNA) PRONOM File Format and Container descriptions. PRONOM is available from http://www.nationalarchives.gov.uk/pronom/ See LICENSE for license information.

Usage

usage: fido [-h] [-v] [-q] [-recurse] [-zip] [-noextension] [-nocontainer]
            [-pronom_only] [-input INPUT] [-filename FILENAME]
            [-useformats INCLUDEPUIDS] [-nouseformats EXCLUDEPUIDS]
            [-matchprintf FORMATSTRING] [-nomatchprintf FORMATSTRING]
            [-bufsize BUFSIZE] [-sigs SIG_ACT]
            [-container_bufsize CONTAINER_BUFSIZE]
            [-loadformats XML1,...,XMLn] [-confdir CONFDIR]
            [FILE [FILE ...]]

positional arguments:

  • FILE: files to check. If the file is -, then read content from stdin. In this case, python must be invoked with -u or it may convert the line terminators.

optional arguments:

  • -h, --help: show this help message and exit
  • -v: show version information
  • -q: run (more) quietly
  • -recurse: recurse into subdirectories
  • -zip: recurse into zip and tar files
  • -nocontainer: disable deep scan of container documents, increases speed but may reduce accuracy with big files
  • -pronom_only: disables loading of format extensions file, only PRONOM signatures are loaded, may reduce accuracy of results
  • -input INPUT: file containing a list of files to check, one per line. - means stdin
  • -filename FILENAME: filename if file contents passed through STDIN
  • -useformats INCLUDEPUIDS: comma separated string of formats to use in identification
  • -nouseformats EXCLUDEPUIDS: comma separated string of formats not to use in identification
  • -matchprintf FORMATSTRING: format string (Python style) to use on match. See nomatchprintf, README.txt.
  • -nomatchprintf FORMATSTRING: format string (Python style) to use if no match. See README.txt
  • -bufsize BUFSIZE: size (in bytes) of the buffer to match against (default=131072 bytes)
  • -sigs SIG_ACT: SIG_ACT "check" for new version of signature file for download. SIG_ACT "list" list all available sig file versions. SIG_ACT "update" to automatically update to latest available sig file. SIG_ACT "n" download and use version n.
  • -container_bufsize CONTAINER_BUFSIZE: size (in bytes) of the buffer to match against (default=524288 bytes)
  • -loadformats XML1,...,XMLn: comma separated string of XML format files to add.
  • -confdir CONFDIR: configuration directory to load_fido_xml, for example, the format specifications from.

Installation

(also see: http://wiki.opf-labs.org/display/KB/FIDO+usage+guide)

Any platform

  1. Download the latest zip release from https://github.com/openpreserve/fido/releases
  2. Unzip into some directory
  3. Open a command shell, cd to the directory that you placed the zip contents into
  4. Run python setup.py install to install FIDO and dependencies. This may require sudo on Linux/OSX or admin privileges on Windows.
  5. You should now be able to see the help text: fido -h

Using pip

  1. Run pip install opf-fido This may require sudo on Linux/OSX or admin privileges on Windows.
  2. You should now be able to see the help text: fido -h

Updating signatures

Signatures can be updated from the OPF's signature service. The service is pull only and iit's location is in the versions.xml configuration file as

<updateSite>https://fidosigs.openpreservation.org</updateSite>

To check what version of the PRONOM signatures you are using type: fido -v and you'll see something like:

FIDO v1.6.0 (pronom-xml-95.zip, container-signature-20200121.xml, format_extensions.xml)

Here pronom-xml-95.zip denotes PRONOM version 95. To see if a more recent set of signatures is available type fido -sigs check which will report back:

Updated signatures v104 are available, current version is v95

if new signatures are available or

Your signature files are up to date, current version is v104

if not. To update signatures to the latest version type fido -sigs update:

Updated signatures v104 are available, current version is v95
Updating signatures

If you are having trouble due to firewall restrictions, see OPF wiki: http://wiki.opf-labs.org/display/PT/Command+Line+Interface+proxy+usage

Please note that this WILL NOT update the container signature file located in the 'conf' folder. The reason for this that the PRONOM container signature file contains special types of sequences which need to be tested before FIDO can use them. If there is an update available for the PRONOM container signature file it will show up in a next commit.

Dependencies

FIDO 1.0 through 1.3.3 will run on Python 2.7 with no other dependencies.

FIDO 1.3.4 and later requires the python dependency 'olefile'. This can be installed using pip install olefile, by running python setup.py install, or a pip installation will handle dependencies.

FIDO 1.3.3 and later have experimental Python 3 support.

FIDO 1.4 and later have Python 3 support.

Format Definitions

By default, FIDO loads format information from two files conf/formats.xml and conf/format_extensions.xml. Addition format files can be specified using the -loadformats command line argument. They should use the same syntax as conf/format_extensions.xml. If more than one format file needs to be specified, then they should be comma separated as with the -formats argument.

Output

Output is controlled with the two parameters matchprintf and nomatchprintf. Each is a string that may contain formating information. They have access to an object called info with the following fields:

  • printmatch: info.version (file format version X), info.alias (format also called X), info.apple_uti (Apple Uniform Type Identifier), info.group_size and info.group_index (if a file has multiple (tentative) hits), info.count (file N)

  • printnomatch: info.count (file N)

The defaults for FIDO 1.0 are:

  • printmatch:

  • "OK,%(info.time)s,%(info.puid)s,%(info.formatname)s,%(info.signaturename)s,%(info.filesize)s,\"%(info.filename)s\",\"%(info.mimetype)s\",\"%(info.matchtype)s\"\n"

  • printnomatch:

  • "KO,%(info.time)s,,,,%(info.filesize)s,\"%(info.filename)s\",,\"%(info.matchtype)s\"\n"

It can be useful to provide an empty string for either, for example to ignore all failed matches, or all successful ones (see examples below). Note that a newline needs to be added to the end of the string using \n.

Matchtypes

FIDO returns the following matchtypes:

  • fail: the object could not be identified with signature or file extension
  • extension: the object could only be identified by file extension
  • signature: the object has been identified with (a) PRONOM signature(s)
  • container: the object has been idenfified with (a) PRONOM container signature(s)

In some cases multiple results are returned.

Examples running FIDO

Identify all files in the current directory and below, sending output into file-info.csv: python fido.py -recurse . > file-info.csv

Do the same as above, but also look inside of zip or tar files: python fido.py -recurse -zip . > file-info.csv

Take input from a list of files:

Linux:

ls > files.txt
python fido.py -input files.txt

Windows:

dir /b > files.txt
python fido.py -input files.txt

Take input from a pipe:

Linux: find . -type f | python fido.py -input -

Windows: dir /b | python fido.py -input -

Only show files that could not be identified: python fido.py -matchprintf "" .

Only show files that could be identified: python fido.py -nomatchprintf "" .

Deep scan of container objects

By default, when FIDO detects that a file is a container (compound) object, it will start a deep (complete) scan of the file using the PRONOM container signatures. When identifying big files, this behaviour can cause FIDO to slow down sigificantly. You can disable deep scanning by invoking FIDO with the -nocontainer argument. While disabling deep scan speeds up identification, it may reduce accuracy.

At the moment (version 1.0) FIDO is not yet able to perform scanning containers which are passed through STDIN. A workaround would be to save the stream to a temporary file and have FIDO identify this file.

License information

See the file "LICENSE.txt" for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES...

fido's People

Contributors

ablwr avatar adamfarquhar avatar anjackson avatar carlwilson avatar edsu avatar georgiamoppett avatar gphemsley avatar hwesta avatar jhsimpson avatar jrwdunham avatar makije avatar mbhopton avatar mistydemeo avatar numeroushats avatar oskarpersson avatar sevein avatar techmaurice avatar worr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fido's Issues

Add multi-threading / multi-processing

The 0.5.x implementation appears to be IO bound. Throughput would be increased by moving file-reads to a separate thread so that they will happen in parallel with pattern matching.

One approach: add multiple workers, each of which reads, matches. Another: add a pool to do reads, and another to do matches.

But - it's all fast enough for now!

uncompressed epubs

Currently epubs whose files are stored in the container uncompressed are recognized by fido as either format fmt/483 or fmt/103.

IMHO fmt/483 ('ePub format') should have precedence over fmt/103 ('Extensible Hypertext Markup Language'). Maybe this is a PRONOM issue, but I was bitten by the fact that fido lists the two formats in different order on different machines. On the test machines 483 is constantly listed first, but in production, the 103 format was first, causing epub detection to fail as we assumed fido would list the best match first.

If I'm not mistaken fido currently does not use the file extension in determining the format, but I'm advocating that file extension should be used to determine the 'best' format match in case of multiple hits. It would make fido more reliable in cases like ours, where a background process cannot pause and wait for an operator to make the proper choice.

Fido "Can't convert 'bytes' object to str implicitly" in Python 3.4/3.5

Hi,
As a new user of Fido, I ran into the error message from the subject while analysing Fido's README.txt on:

  • Windows 8.1 Enterprise (64 bits)
  • Python 3.4
  • Fido 1.3.4
    The problem seems related to differences between Python 2 and 3 (Unicode handling).
    If I use Python 2.7.11, Fido works just fine ("Plain Text File").
    If I use Python 3.5.1, I get the same error message.
    See below for more details.

c:\fido>fido
usage: fido-script.py [-h] [-v] [-q] [-recurse] [-zip] [-nocontainer]
[-pronom_only] [-input INPUT] [-filename FILENAME]
[-useformats INCLUDEPUIDS] [-nouseformats EXCLUDEPUIDS]
[-matchprintf FORMATSTRING]
[-nomatchprintf FORMATSTRING] [-bufsize BUFSIZE]
[-container_bufsize CONTAINER_BUFSIZE]
[-loadformats XML1,...,XMLn] [-confdir CONFDIR]
[FILE [FILE ...]]
(etc. - Fido seems to have been installed properly)

c:\fido>fido README.txt
FIDO v1.3.4 (formats-v84.xml, container-signature-20160121.xml, format_extension
s.xml)
Traceback (most recent call last):
File "C:\Python34\Scripts\fido-script.py", line 9, in
load_entry_point('opf-fido==1.3.4', 'console_scripts', 'fido')()
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 869, in main
fido.identify_file(file)
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 375, in identify_file
bofbuffer, eofbuffer, _ = self.get_buffers(f, size, seekable=True)
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 543, in get_buffers
bofbuffer = self.blocking_read(stream, bytes_to_read)
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 527, in blocking_read
buffer += readbuffer
TypeError: Can't convert 'bytes' object to str implicitly

c:\fido>

Determining file formats within a ZIP file gives an [Error 2] but then correctly determines the format

fido.py Personal_Files_Folder.zip yields:

OK,168,x-fmt/263,"ZIP Format","ZIP format",294895,"Personal_Files_Folder.zip","application/zip","signature"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",3919,"Personal_Files_Folder.zip!Personal_Files_Folder/BIBORGAN.SC","None","signature"
[Errno 2] No such file or directory: 'Personal_Files_Folder.zip!Personal_Files_Folder/CH7.RD'
OK,11,fmt/111,"OLE2 Compound Document Format","OLE2 Compound Document Format",149504,"Personal_Files_Folder.zip!Personal_Files_Folder/CH7.RD","None","signature"
OK,6,fmt/393,"Borland Reflex flat datafile","Borland Reflex flat datafile",10808,"Personal_Files_Folder.zip!Personal_Files_Folder/COURTNE.RXD","None","signature"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",7400,"Personal_Files_Folder.zip!Personal_Files_Folder/DELIVERY","None","signature"
OK,20,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",156689,"Personal_Files_Folder.zip!Personal_Files_Folder/DIRMIN2","None","signature"
OK,10,x-fmt/22,"7-bit ASCII Text","External",30464,"Personal_Files_Folder.zip!Personal_Files_Folder/INDEX.ASC","text/plain","extension"
OK,10,x-fmt/283,"8-bit ASCII Text","External",30464,"Personal_Files_Folder.zip!Personal_Files_Folder/INDEX.ASC","text/plain","extension"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",9915,"Personal_Files_Folder.zip!Personal_Files_Folder/MODULE1.RH","None","signature"
OK,8,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",27580,"Personal_Files_Folder.zip!Personal_Files_Folder/NZ&AUST.WP","None","signature"
OK,26,x-fmt/8,"dBASE Database","dBase Table Version II (date last updated (month (1-12), day (1-31), year)",819200,"Personal_Files_Folder.zip!Personal_Files_Folder/NZPN.DBF","None","signature"
KO,23,,,,220672,"Personal_Files_Folder.zip!Personal_Files_Folder/NZPNPERS.NDX",,"fail"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",3062,"Personal_Files_Folder.zip!Personal_Files_Folder/PTCHALMI.WP","None","signature"
OK,12,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",59816,"Personal_Files_Folder.zip!Personal_Files_Folder/SEMINAR.DOC","None","signature"
OK,9,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",30937,"Personal_Files_Folder.zip!Personal_Files_Folder/SESSION2","None","signature"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",7494,"Personal_Files_Folder.zip!Personal_Files_Folder/TOWNNUMB","None","signature"
OK,17,fmt/125,"Microsoft Powerpoint Presentation","Powerpoint 95",76800,"Personal_Files_Folder.zip!Personal_Files_Folder/Week7.ppt","application/vnd.ms-powerpoint","signature"

The second file within the zip is CH7.RD which fido claims is not found, but then it successfully determines the format. Also, running fido.py on the unzipped files from this .zip works fine.

Maurice Bouchard commented:

The command I issued should read:
fido.py -zip Personal_Files_Folder.zip

sorry for the confusion.

Maurice de Rooij commented:

Thank you very much for reporting.

This issue will be fixed in the next commit.

The second file within the zip is CH7.RD which fido claims is not found, but then it successfully determines the format. Also, running fido.py on the unzipped files from this .zip works fine.

The read error is due to the fact that the function which analyzes container files is not yet able to recurse into zipfiles.
The successfull determination of the format afterwards is because that result is originally the result that triggered the container function.

Original issue: FIDO-28

Signature file improvements?

Some ideas for improving how signature files are handled

One idea is to use Roy to create signature files. This would allow access to a wider range of formats since it can include information from Apache Tika, freedesktop.org MIME-info and Library of Congress FDDs.

Another is to use DROID's signature files directly, since they're what PRONOM offers by default, and not have to perform a transformation. This may be a simpler change than above.

Other suggestions welcome!

Fix fetching example URLs when updating signature

When updating signatures, if the format has a ReferenceFileIdentifier of type URL, we include a reference to it, including fetching it and calculating a checksum. However, ReferenceFileIdentifier is not consistent in its meaning or format.

Eg from PRONOM 88 where fmt/11 starts with a www, and the URL is actually a PNG

<ReferenceFileIdentifier>
  <Identifier>www.w3.org/Graphics/PNG/nurbcup2si.png</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>
...
<ReferenceFileIdentifier>
  <Identifier>www.w3.org/Graphics/PNG/666.png</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>

compared to fmt/569, which starts with http:// and is a HTML page linking to examples

<ReferenceFileIdentifier>
  <Identifier>http://www.matroska.org/downloads/test_w1.html</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>

When parsing it, we prepend http:// and fetch it, which breaks with http://www.matroska.org/downloads/test_w1.html

url = "http://" + get_text_tna(id, 'Identifier')
...
sock = urlopen(url)

Options include removing the examples and checksums from formats-v##.xml, or adding error handling around that section.

File extension identification should use the most generic match

I came across an edge case of fido's file extension identification when trying to look at some invalid XML. It looks like fido is following normal signature precedence rules, even when that doesn't make as much sense for non-signature identification/

For example, look at this file: https://gist.github.com/mistydemeo/8967705/raw/f273b8df4ee2998776fafcd6b1e99b94549181a8/pointer.xml

There's no signature match, since it's missing an XML declaration. When fido falls back to using file extension, though, it has this curious result:

FIDO v1.3.1 (formats-v70.xml, container-signature-20130501.xml, format_extensions.xml)
OK,130,fmt/121,"DROID Signature File Format","External",2779,"pointer.xml","text/xml","extension"
OK,130,fmt/120,"DROID File Collection File Format","External",2779,"pointer.xml","text/xml","extension"
FIDO: Processed      1 files in 170.00 msec,  6 files/sec

It's not exactly a DROID signature file! Turns out that fmt/121 declares precedence over fmt/101 (XML), and fido duly follows that when identifying by extension.

Given that extension matching is spotty, and only happens when specific matches haven't occurred, I think fido's behaviour should be the opposite here - it should match the most general format instead of following precedence to the most specific.

MKV File format

Hi,

I try to check if a MKV is really a MKV file (video/x-matroska mimetype), with FIDO and formats-v81.xml. I use this command :

$ python fido.py -matchprintf "%(info.mimetype)s\n" "my-file.mkv"

But it always return "None". Can you help me please ?

Thanks

invalid SRE code

Dev Effort

0.5D

Description

I don't know why, since few days fido miss docx identification and found zip in his place and a message 'invalid SRE code' appear on beginning. As I read in the previous issue that is look like 'invalid SRE code' is a bug in Python v2.7.3, so I tested with Python v2.7.6 : no more 'invalid SRE code' but docx not recognize though.. :

FIDO v1.3.1 (formats-v78.xml, container-signature-20130501.xml, format_extensions.xml)
invalid SRE code
OK,9,x-fmt/263,"ZIP Format","ZIP format",3734,"/home/fajir/test.docx","application/zip","signature"

Altough the first item (customized with the pronom puid as replace of fido-puid) has priority over x-fmt/263 in my conf/format_extensions.xml :

<format>
<puid>fmt/412</puid>
<name>Microsoft Office Open XML - Word</name>
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mime>
<extension>docx</extension>
<has_priority_over>x-fmt/263</has_priority_over>
<has_priority_over>fmt/189</has_priority_over>
<signature>
<name>Microsoft Office Open XML - Word</name>
<pattern><position>BOF</position><regex>(?s)\APK\x03\x04</regex></pattern>
<pattern><position>BOF</position><regex>(?s)\A.{30}\[Content_Types\]\.xml \xa2</regex></pattern>
<pattern><position>EOF</position><regex>(?s)\x00\x00word/.{1,20}\.xmlPK\x01\x02\x2d.{0,2000}\Z</regex></pattern>
</signature>
</format>

Any help ?

Automated importing of PRONOM signatures and file extensions

Currently there is no way to automatically convert PRONOM/DROID signatures to Fido-compatible format. In an operational setting this would be a pretty severe limitation, and it makes managing the signature information quite difficult . Also, it would be helpful to use some kind of versioning scheme for the 'formats' and 'format_extensions' files, and some information on the provenance of the information in these files, (e.g. "DROID" + sig file number).

Andrew Jackson added a comment - 19/Sep/11 3:23 PM
I don't quite understand this issue, as Fido does contain code to turn PRONOM Format Records into Fido signatures. Does this issue refer to the ability to re-use the pre-compiled signatures in the DROID signature file? That might be possible, but will certainly be rather ugly. It might be easier to download the corresponding Format Record instead of using the DROID sig file directly.

Maurice de Rooij added a comment - 19/Sep/11 4:26 PM
There is a script called 'prepare.py' which converts the pronom-xml.zip in 'conf'.
Have just fixed a minor bug which caused the script to crash when it encountered a certain byte while saving the formats.xml file. The current script to fetch the Format Records is not very cross-platform friendly (a bash script) and am currently extending 'prepare.py' to fetch AND convert the Format Records on the fly.

Original issue: FIDO-6

Refactor code to Python 3.x

Refactoring to Python 3.x involves following tasks

* Can't convert 'bytes' object to str implicitly in def identify_file
* urlparse is now moved to urllib, but import fails
* builtin object problems

Original issue: FIDO-17

Add seek to the zip file-like-object

The zipitem file-like-object supports read(n_bytes), but does not support seek(). When the item is compressed, then seek will have to scan through from the start - inefficient, but it would eliminate the need for special handling.

Add format groups

Add format groups so that it is easier to use the -formats or -excludeformats arguments. For example, if all of the PDF formats were placed into a group, then
fido.run -formats pdf -r .
would identify all of the non-pdf documents in the directory tree.

Support local extensions to the format library

Provide a method to extend the set of signatures. Perhaps a file which holds the basic information. Once mature, the new signatures could be added to Pronom or into the Pronom XML syntax.

Assertion error while updating signature file

I just installed Fido 1.0 (Win XP) after which I tried to update the signature file (latest vrsion is v 59). At the end of the updating procedure, while Fido is trying to convert the PRONOM signatures to Fido's format an assertion error occurs. Below is a screen dump of the updating procedure:

C:\fido>c:\python27\python .\update_signatures.py
FIDO signature updater v1.0
Contacting PRONOM...
Querying latest signaturefile version...
Downloading signature file version 59...
Extracting PRONOM PUID's from signature file...
Found 864 PRONOM PUID's
Downloading signatures can take a while
Continue and download signatures? (yes/no): y
Creating temporary folder for download: C:\fido\conf\tmp
Downloading signatures, one moment please...
  100%
Creating PRONOM zip, adding files with compression mode 'deflated'
Deleting temporary folder and files...
Preparing to convert PRONOM formats to FIDO signatures...
Conversion: Illegal character in bracket: char='0', at pos 31 in
  52494646{4}57415645*666D7420[12000000:FFFFFF7F][!FEFF]{16-*}64617461
                                 ^
Buffer = (?s)\ARIFF.{4}WAVE.*fmt [\x12
Traceback (most recent call last):
  File ".\update_signatures.py", line 131, in <module>
    main()
  File ".\update_signatures.py", line 126, in main
    prepare.main()
  File "C:\fido\prepare.py", line 574, in main
    info.load_pronom_xml(args.puid)
  File "C:\fido\prepare.py", line 105, in load_pronom_xml
    format = self.parse_pronom_xml(stream, puid_filter)
  File "C:\fido\prepare.py", line 185, in parse_pronom_xml
    regex = convert_to_regex(bytes, 'Little', pos, offset, max_offset)
  File "C:\fido\prepare.py", line 456, in convert_to_regex
    assert(chars[i] == ':')
AssertionError

epub recognized as xls

tried with several epub files, same behaviour

$ ./fido.py ~/Downloads/Zizek\ -\ Vivere\ alla\ fine\ dei\ tempi.epub

FIDO v1.1.2 (formats-v66.xml, container-signature-20121218.xml, format_extensions.xml)
OK,295,x-fmt/263,"ZIP Format","ZIP format",742241,"/Users/void/Downloads/Zizek - Vivere alla fine dei tempi [Ladri di biblioteche].epub","application/zip","container"
OK,295,fmt/61,"Microsoft Excel 97 Workbook (xls)","BIFF 8 & 8X Workbook (generic)",742241,"/Users/raffaele/Downloads/Zizek - Vivere alla fine dei tempi.epub","application/vnd.ms-excel","container"
FIDO: Processed      1 files in 386.89 msec,  3 files/sec

Some pdf and htm files are not recognised

With the v40 signatures, some files are not being correctly identified. These include some pdf's, doc's, htm's, and mov's. Need to (1) check if the behaviour has changed from v39; (2) check the signatures; (3) patch-up any missing signatures.

Match each pattern only once per file

Many signatures re-use patterns. For example, the PDFs all have the same end-of-file pattern. The Zip family (jar, zip, odf, ooxml, ...) all share some patterns. It would be easy to check these once per file.
A better approach might be to change the signature approach so that these tests are moved up to a super-type and only stored once in Pronom. This would help to avoid inconsistencies between signatures.

Suggestion for improved identification of XML using XMP parser

Identification of XML goes wrong if files don't contain XML declaration (which is not required by XML spec). Not a Fido bug, but simply a limitation of signature-based identification.

Possible solution: check for XML well-formedness using Python's Expat parser; possibly add this as a user-activated option. More details + sample code here:

http://www.openplanetsfoundation.org/blogs/2011-07-11-improved-identification-xml-python-experiment

Original issue: FIDO-14

bogus escape: '\\x' on word documents identification (container-signature updated : container-signature-20140923.xml)

I tried to update conf/container-signature file (container-signature-20140923.xml) to see the difference on word document identification and the result is bad : docx files are no more recognized (identify as a zip), see :

with container-signature-20140923.xml :

FIDO v1.3.1 (formats-v78.xml, container-signature-20140923.xml, format_extensions.xml)
OK,240,fmt/40,"Microsoft Word for Windows Document","Microsoft Word for Windows 97 - 2002",11776,"/home/fajir/docs/AnnexeDoc.doc","application/msword","signature"
bogus escape: '\\x'
OK,10,x-fmt/263,"ZIP Format","ZIP format",25204,"/home/fajir/docs/cours sur les theories de la motivation.docx","application/zip","signature"

with container-signature-20130501.xml :

FIDO v1.3.1 (formats-v78.xml, container-signature-20130501.xml, format_extensions.xml)
OK,230,fmt/40,"Microsoft Word for Windows Document","Microsoft Word for Windows 97 - 2002",11776,"/home/fajir/docs/AnnexeDoc.doc","application/msword","signature"
OK,29,fmt/412,"Microsoft Office Open XML - Word","Microsoft Office Open XML - Word",25204,"/home/fajir/docs/cours sur les theories de la motivation.docx","None","signature"

I think the bug come from "bogus escape: '\x'"

Peformance testing

I've done informal performance testing getting 20-60 files per second with the Oct-2010 signature files. This should be done in a proper controlled environment using an established corpus.

Release process

Github holds the source. We also need a place to put the Windows installer. What is the conventional approach?

Fido identifies PowerPoint file as Excel

Dev Effort

0.5D

Description

The following file is misidentified by fido via its container signature: https://drive.google.com/file/d/0B_ULgjJDmvCkRGdzREc1WFB6Ym8/edit?usp=sharing (File => Download will download the original unconverted document.)

fido returns the following ID:

OK,550,fmt/61,"Microsoft Excel 97 Workbook (xls)","BIFF 8 & 8X Workbook (generic)",2706944,"d775b31c-b627-4f7c-908a-9a3502e18e69_Archivematica-0.6-alpha-screenshots.ppt","application/vnd.ms-excel","container"

Whereas DROID returns:

/home/mistydemeo/artefactual/archivematica/src/MCPServer/share/sharedDirectoryStructure/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/maildir-dd365c04-0c91-4bbc-9e4a-b24c27533af1/objects/attachments/Gmail.Sent_Mail/cur/d775b31c-b627-4f7c-908a-9a3502e18e69_Archivematica-0.6-alpha-screenshots.ppt,fmt/126

fmt/126 (Microsoft Powerpoint Presentation (97-XP)) is the correct ID for this file.

Looks like this is a bufsize size issue again (cf. #41) - if I increase the buffer size to 1MB, it's identified. Is this just expected behaviour in the default config?

xls format give two results

When I analyze an xls document (http://lecompagnon.info/demos/demoxl1.xls)
fido give me two results:

[0] => Array
    (
        [result] => OK
        [puid] => fmt/62
        [formatname] => Microsoft Excel 2000-2003 Workbook (xls)
        [version] => 8X
        [signaturename] => BIFF 8 & 8X Workbook (generic)
        [mimetype] => application/vnd.ms-excel
    )

[1] => Array
    (
        [result] => OK
        [puid] => fmt/61
        [formatname] => Microsoft Excel 97 Workbook (xls)
        [version] => 8
        [signaturename] => BIFF 8 & 8X Workbook (generic)
        [mimetype] => application/vnd.ms-excel
    )

When using -zip experiencing errors on a file from OPF Format Corpus

Attempting to scan the opt-format-corpus I'm seeing an error from a specific file:

pdfCabinetOfHorrors/embedded_video_quicktime.doc

  goatslayer@goatslayer-acer-linux:~/git/opf-format-corpus/format-corpus/pdfCabinetOfHorrors$ fido -zip embedded_video_quicktime.doc
  FIDO v1.3.3 (formats-v84.xml, container-signature-20160121.xml, format_extensions.xml)
  bad repeat interval
  bad repeat interval
  OK,250,fmt/111,"OLE2 Compound Document Format","OLE2 Compound Document Format",26624,"embedded_video_quicktime.doc","None","signature"
  Traceback (most recent call last):
    File "/usr/local/bin/fido", line 9, in <module>
      load_entry_point('opf-fido==1.3.3', 'console_scripts', 'fido')()
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 855, in main
      fido.identify_file(file)
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 400, in identify_file
      self.identify_contents(filename, type=self.container_type(matches))
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 418, in identify_contents
      raise RuntimeError("Unknown container type: " + repr(type))
  RuntimeError: Unknown container type: 'ole'

Distro stats:

Python 2.7.6
No LSB modules are available.
Distributor ID: Ubuntu 
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty

My mirror of the OPF Format Corpus can be found here: https://github.com/ross-spencer/opf-format-corpus

Matching failed when unzipping

I found that fido was not giving consistent results for the same files when they were stored in a ZIP rather than as a plain file. To reproduce, attempt to identify the contents of the govdocs1 subset0.zip test files.

Adobe Illustrator 14 file identified as PDF 1.5, not AI

This Adobe Illustrator sample is being misidentified in fido 1.3.1 using the PRONOM v70 signatures: https://github.com/artefactual/archivematica-sampledata/raw/master/SampleTransfers/Images/BBhelmet.ai

The file is an Illustrator 14 (CS4) file (fmt/563), but is being identified as PDF 1.5 (fmt/19). This isn't actually wrong per se (since AI files are a superset of PDF), but isn't fully accurate. DROID 6.1.2, using the same v70 signature files, correctly identifies the file as fmt/563.

Cleanup extensions file

The extensions file needs a cleanup as some signatures are in PRONOM and/or in the container signature file.

Review recent code that extends fido XML to include more registry data

Submitted by Andrew Jackson:

To clarify, I added experimental code that pulls more details out of the Format Record and populates an additional 'details' section in the Fido signature file. It looks like this:

<details>
      <dc:description>This is an outline record only, and requires further details, research or authentication to provide information that will enable users to further understand the format and to assess digital preservation risks associated with it if appropriate. If you are able to help by supplying any additional information concerning this entry, please return to the main PRONOM page and select &#226;&#128;&#152;Add an Entry&#226;&#128;&#153;.</dc:description>
      <dcterms:available />
      <dc:creator />
      <dcterms:publisher />
      <content_type />
      <record_metadata>
        <status>unknown</status>
        <dc:creator>Digital Preservation Department / The National Archives</dc:creator>
        <dcterms:created>11 Mar 2005</dcterms:created>
        <dcterms:modified>02 Aug 2005</dcterms:modified>
        <dc:description />
      </record_metadata>
    </details>

The problem is that it's not clear that this is a good idea, and it may slow down parsing unnecessarily. I am increasingly of the opinion that most of this data should be in a true format registry, and that identification tools should only include a minimal amount of data and refer the user to the registry for these kind of details.

Having said that, this is not a critical issue for Fido as it only slows things down, at worst.

Original issue: FIDO-2

ascii text

Why is it that fido cannot identify a file with the contents:

Hello world.

Whereas the Unix file utility can?

Format Extensions "Registry", updating through "update_signatures" script

At the moment the "format_extensions.xml" file with advanced signatures or signatures unknown to PRONOM is updated by committing the changed file to the FIDO codebase.

The drawback of this method is that users who add their own signatures to this file are in danger that a new version of FIDO or a new version of the extension file overwrites their changes.

Ideally, we should create a GitHub project for the Format Extensions, as a sort of registry, from which the "update_signatures" script pulls the changes.

This way users are able to create pull requests for advanced or unknown signatures to have them added to the Format Extension file.

Additionally, there should be a "user_extensions.xml" file with "special" or "private" signatures which is untouched by any of the update processes.

1.3.3 struggling with zero byte files

Stack trace:

  goatslayer@goatslayer-acer-linux:~/git/droid-sqlite-analysis$ fido empty-file.empty 
  FIDO v1.3.3 (formats-v84.xml, container-signature-20160121.xml, format_extensions.xml)
  FIDO: Zero byte file (empty): Path is: empty-file.empty
  Traceback (most recent call last):
    File "/usr/local/bin/fido", line 9, in <module>
      load_entry_point('opf-fido==1.3.3', 'console_scripts', 'fido')()
    File "/usr/local/lib/python2.7/dist-packages/fido/fido.py", line 855, in main
      fido.identify_file(file)
    File "/usr/local/lib/python2.7/dist-packages/fido/fido.py", line 375, in identify_file
      bofbuffer, eofbuffer = self.get_buffers(f, size, seekable=True)
  ValueError: too many values to unpack

Empty file listing below:

  goatslayer@goatslayer-acer-linux:~/git/droid-sqlite-analysis$ ls -l empty-file.empty 
  -rw-rw-r-- 1 goatslayer goatslayer 0 May 15 12:48 empty-file.empty
  goatslayer@goatslayer-acer-linux:~/git/droid-sqlite-analysis$ 

Distro stats:

Python 2.7.6
No LSB modules are available.
Distributor ID: Ubuntu 
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty

Specifc regex fails to parse on some Python installations

Prior to Python 2.7 commit 82219:c1b3d25882ca, the maximum repetition number in a regular expression was 65535. (It's now 4294967294 for 64-bit platforms.) I believe 2.7.5 is the first 2.x series Python with this change; it was also applied to Python 3.2 and 3.3 releases in the last year.

This ends up being a problem because one regular expression in PRONOM, the one for x-fmt/386, actually checks for 65536 repetitions of something:

(?s)\A.{0,0}\x00\x00\x01\xba.{8,12}\x00\x00\x01\xbb.{8,65536}\x00\x00\x01\xb3.{8,128}\x00\x00\x01\xb5

As a result, older Python 2.7.x releases can't compile this regular expression, raising the RuntimeError "invalid SRE code". I encountered this when using FIDO on Ubuntu 12.04, which ships Python 2.7.3. (There's no such problem in recent OS X releases or Ubuntu 14.04.)

This has strange results on file identification. When scanning a TIFF, I noticed that (even though x-fmt/386 is an MPEG format that should not match the file either way), incorrect results are returned on OSs where the exception is raised vs OSs where it is not.

Python 2.7.6:

OK,317,fmt/353,"Tagged Image File Format","TIFF generic (little-endian)",28860926,"/Users/vlcice/Downloads/Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","signature"

Python 2.7.3:

OK,40,fmt/152,"Digital Negative Format (DNG)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/153,"Tagged Image File Format for Image Technology (TIFF/IT)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/154,"Tagged Image File Format for Electronic Photography (TIFF/EP)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/155,"Geographic Tagged Image File Format (GeoTIFF)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/156,"Tagged Image File Format for Internet Fax (TIFF-FX)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"

I've scanned through every regex in the current extension file, and x-fmt/386 is the only one that doesn't compile in earlier Pythons.

Delete old versions of PRONOM files?

Older PRONOM updates (eg 0bbf39d) don't keep the old version around, but update & rename the DROID_SignatureFile-V##.xml, formats-v##.xml and pronom-xml-v##.zip. Newer updates to PRONOM (eg #81)have kept the older version around, presumably as a reference/backup. Is there a benefit to keeping the old versions around or should they be deleted? They're still available in the version history if something goes wrong with the new version, but wouldn't be easily available in a non-development install.

Python 3 support

Dev Effort

10D

Description

Improve Python3 support.

Some inital work was done in #67 but as issues like #78 suggest it still needs some attention.

To Do:

  • Check for bytes vs str mismatch (eg #67)
  • Fix update_signatures to not use deprecated httplib.HTTP
  • ... other?

Unittests

Currently there are no unit or regression tests. They should be added.

wrong result for odt into zip

A scan in an archive (with the -zip arg) which contain an odt file identify the odt as a ZIP Format (wrong) and analyze all the objects into (jpg, xml, etc.)

Strangely, when I scan the same odt file with the -zip arg, fido detect normally the OpenDocument Text format without searching into.

Maybe you can create an argument to scan into odt container but disable it by default when scanning an archive.

Anyway, the result ZIP Format is wrong for OpenDocument in a zip.

Fix install for linux

The setup.py is not quite right for linux (although the windows installer works fine). The setup.py should probably be up a directory.

Accept content from stdin

As of 0.7, Fido accepts input from files and a list of files from stdin.
Add the ability to accept content from stdin, perhaps when the file list is '-'.

This will allow checking of one file per invocation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.