Comments (5)
Because file (and Tika) have additional heuristic format detection modules specifically for this case. IIRC, the file one works in the same way as the Tika one, and is based on the absence bytes other than those consistent with ASCII/UTF-8 in the first X bytes of the item.
DROID only has binary signature and container signature modules, and therefore so does Fido. It might be possible to add such a module to Fido as a fallback if binary and container signatures have no match. Of course, this would make the results inconsistent with those from DROID, so it would presumably have to be an opt-in option.
Also, I've noticed that Tika has deliberately decided to make the 'text/plain' detection take precedence over file-extension based identification unless the identified format is known to be a sub-set of 'text/plain'. I'm not sure what the rationale behind this is, but it might be worth finding out in case we've missed something. I suspect it's just because they are interested in making sure the text is extracted reliably, in which case we need not worry as that's not what DROID/Fido is for.
from fido.
Thanks @anjackson I guess the same question could be posed to DROID. It seems kind of wrong that a simple ASCII text file can't be identified. I was running fido on a set of born digital content, and the results looked like:
??? 6971
WordPerfect for MS-DOS Document 4379
WordPerfect for MS-DOS/Windows Document 1896
Microsoft Word for Windows Macro 76
OLE2 Compound Document Format 26
Plain Text File 25
PrimeOCR 25
Generic Library File 19
Comma Separated Values 17
Adobe Font List 11
MacPaint Image 8
Microsoft Office Owner File 6
Hypertext Markup Language 6
Lotus Notes File 3
Microsoft Word for Macintosh Document 3
PostScript 3
MS-DOS Executable 3
Microsoft Excel Backup 2
Encapsulated PostScript File Format 2
Microsoft Windows Cursor 2
Lotus 1-2-3 Spreadsheet Formatting File 2
Macromedia Director 2
AutoCAD Script 2
dBASE Database 2
Applixware Spreadsheet 2
Freelance File 2
LaTeX (Subdocument) 1
Microsoft Animated Cursor Format 1
Digital Terrain Elevation Data 1
DesignCAD Drawing 1
Windows New Executable 1
Batch file (executable) 1
Statistica Report File 1
Internet Message Format 1
Pagemaker TableEditor Graphics 1
Enigma Binary File (Finale) 1
Rocket Book eBook format 1
Microsoft Word for Windows Document 1
Cascading Style Sheet 1
Vista Pro Graphics 1
Macintosh PICT Image 1
Where a large number of ??? were ASCII text files which (in theory) should've been lumped in with Plain Text Files.
from fido.
I’m not sure of the current code, but previous versions of fido had an extension mechanism to cover this sort of thing – as well as various issues with pronom signatures at the time. Droid-compatibility is useful as an option, but doesn’t need to be either the default or only behaviour. Again, I’m not sure of the current code, but there has always a need to prioritise matches (e.g., we might prefer xml or python as types over generic utf-8), so simply adding an extension matching ascii and another to match utf-8 should be pretty easy.
from fido.
DROID does use extension-based matching, but I'm not sure how good the coverage is (there are an awful lot of text/plain types after all). This (along with the fact that DROID is a bit too strict with messy formats like HTML) is why I tend to use Tika and DROID in combination.
Which reminds me that none of these tools are very good at plain text formats unless the extensions is correct. Reliably detecting CSS, JS, CSV, etc. is still painful.
from fido.
Closing as seems to be a new feature release for text matching that's unlikely to be developed.
from fido.
Related Issues (20)
- Question re: regex used in FIDO HOT 3
- Price-matching other repos HOT 3
- No 1.4.0 release available HOT 1
- Crash on XLS format 59 HOT 3
- FIDO should use the latest PRONOM release (v.96)
- 1.4.1 wheel does not match source, missing format file HOT 1
- Pronom version number needs to be updated HOT 2
- setuptools requirement in setup.py:install_requires is unsafe HOT 1
- Fido hanging on skeleton stream (fmt/1000) HOT 3
- Current fido release 1.4.1 does not find pronom v95 HOT 1
- olefile as a dependency at version >= 0.46 HOT 2
- fido documentation link fails HOT 2
- Updating signatures fails when the URL of the reference file identifier can't be found HOT 2
- convert PRONOM formats to FIDO signature fails HOT 7
- Migrate from 1.4.1 to 1.6.1 : FileNotFoundError: [Errno 2] No such file or directory: '.../fido/conf/formats-v104.xml' HOT 13
- Automation of update of FIDO signature site HOT 1
- Python 2 begone. HOT 1
- Migrate FIDO documentation to docs directory HOT 1
- FIDO should support multiple signature sources
- fido uses PRONOM v109 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fido.