j256 / simplemagic Goto Github PK

View Code? Open in Web Editor NEW

217.0 14.0 46.0 5.21 MB

Simple file magic number and content-type library which provides mime-type determination from files and byte arrays

Home Page: http://256stuff.com/sources/simplemagic/

License: ISC License

Shell 0.96% Makefile 0.13% Java 91.72% HTML 0.09% Perl 0.01% Rich Text Format 7.09%

magic java mime unix

simplemagic's Introduction

Java Simple Magic

Here's a "magic" number package which allows content-type (mime-type) determination from files and byte arrays. It makes use of the magic(5) Unix content-type files to implement the same functionality as the Unix file(1) command in Java which detects the contents of a file. It uses either internal config files or can read /etc/magic, /usr/share/file/magic, or other magic(5) files and determine file content from File, InputStream, or byte[].

For more information, visit the home page.
The source code be found on the git repository.
Maven packages are published via

Enjoy. Gray Watson

Getting Started

To get started you use the SimpleMagic package like the following:

// create a magic utility using the internal magic file
ContentInfoUtil util = new ContentInfoUtil();
// if you want to use a different config file(s), you can load them by hand:
// ContentInfoUtil util = new ContentInfoUtil("/etc/magic");
// ...
ContentInfo info = util.findMatch("/tmp/upload.tmp");
// or
ContentInfo info = util.findMatch(inputStream);
// or
ContentInfo info = util.findMatch(contentByteArray);

Once you have the ContentInfo it provides:

Enumerated type if the type is common
Approximate content-name
Full message produced by the magic file
Mime-type string if one configured by the config file
Associated file extensions (if any)

For example:

HTML, mime 'text/html', msg 'HTML document text'
Java, msg 'Java serialization data, version 5'
PDF, mime 'application/pdf', msg 'PDF document, version 1.4'
gzip, mime 'application/x-gzip', msg 'gzip compressed data, was "", from Unix...'
GIF, mime 'image/gif', msg 'GIF image data, version 89a, 16 x 16'
PNG, mime 'image/png', msg 'PNG image, 600 x 371, 8-bit/color RGB, non-interlaced'
ISO, mime 'audio/mp4', msg 'ISO Media, MPEG v4 system, iTunes AAC-LC'
Microsoft, mime 'application/msword', msg 'Microsoft Word Document'
RIFF, mime 'audio/x-wav', msg 'RIFF (little-endian) data, WAVE audio, Microsoft...'
JPEG, mime 'image/jpeg', msg 'JPEG image data, JFIF standard 1.01'

Maven Configuration

Maven packages are published via

<dependency>
	<groupId>com.j256.simplemagic</groupId>
	<artifactId>simplemagic</artifactId>
	<version>1.17</version>
</dependency>

ChangeLog Release Notes

See the ChangeLog.txt file.

simplemagic's People

Contributors

Stargazers

Watchers

Forkers

kkuegler smokeice11 jmarin4 pombreda ming-hai angrilove robstryker xee5ch yongminyan xieli ferlycreator malapert crawley cumaha huangdaiyi todorov7 opendedup joshmccullough nahasops fossabot morristech dhiraj stokito gandji misselvexu palem1988 sirinartk 5l1v3r1 vinceh121 softvision-dev daioneone lfpsoft sunyv tomia crazyshado qmonk oovvoo goofwear goodapiyes unsigned- muntashirakon lvjinbiao androidkitkat zhaowynn l1053727938

simplemagic's Issues

Support for empty files

An empty file will return null as well as when the file mime type is not found on running ContentInfoUtil#findMatch.

Is it be possible to add support for differentiating between two cases?

ArrayOutOfBoundsException on exe

I'm attempting to get the content type of an exe file (putty.exe) and it is throwing an ArrayIndexOutOfBoundsException. I can't upload putty.exe as it is not supported. It is downloadable here: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe

2017-08-23 16:46:29 ERROR 1503521189293246 RequestUtils.java:994 Simple Magic error during detection java.lang.ArrayIndexOutOfBoundsException: -5 at com.j256.simplemagic.endian.LittleEndianConverter.convertNumber(LittleEndianConverter.java:40) at com.j256.simplemagic.endian.LittleEndianConverter.convertNumber(LittleEndianConverter.java:16) at com.j256.simplemagic.types.NumberType.extractValueFromBytes(NumberType.java:48) at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:140) at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:181) at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:181) at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:66) at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:128) at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:122) at com.j256.simplemagic.ContentInfoUtil.findMatch(ContentInfoUtil.java:258)

I am calling this using:
private static final ContentInfoUtil CONTENT_TYPE_DETECTOR_SIMPLE_MAGIC = new ContentInfoUtil(); ContentInfo info = CONTENT_TYPE_DETECTOR_SIMPLE_MAGIC.findMatch(fileItem.get());

I am using version 1.12

ArrayIndexOutOfBoundsException in com.j256.simplemagic.types.StringType.charFromByte

Stack trace:

    java.lang.ArrayIndexOutOfBoundsException: -556514324

at com.j256.simplemagic.types.StringType.charFromByte(StringType.java:184)
at com.j256.simplemagic.types.StringType.findOffsetMatch(StringType.java:124)
at com.j256.simplemagic.types.StringType.isMatch(StringType.java:88)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:146)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:181)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:181)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:66)
at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:128)
at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:122)
at com.j256.simplemagic.ContentInfoUtil.findMatch(ContentInfoUtil.java:258)

Minimal example:

public void test() {
      byte[] x = {
                77,
                90,
                -19,
                -22,
                26,
                -86,
                -36,
                125,
                81,
                56,
                92,
                49,
                -27,
                85,
                125,
                34,
                88,
                -103,
                -55,
                58,
                -21,
                23,
                8,
                48,
                -11,
                121,
                85,
                -26,
                -30,
                45,
                3,
                39,
                -122,
                -68,
                87,
                -26,
                23,
                -15,
                -117,
                104,
                76,
                -100,
                -51,
                36,
                -100,
                29,
                42,
                -69,
                -8,
                56,
                -51,
                -85,
                6,
                -36,
                -118,
                -101,
                -86,
                -68,
                -22,
                -98,
                -20,
                67,
                -44,
                -34,
                -62,
                37,
                -38,
                -12,
                -30
        };

        ContentInfoUtil detector = new ContentInfoUtil();

        detector.findMatch(x);

}

I think the issue is in MagicEntry.OffsetInfo.getOffset. When val is casted to an integer, the result is a negative value. This value ends up being used as an offset into an array.

Excel and Powerpoint 97-2003 files return null mime type

I searched my magic file and found a few suspect entries for Excel, but found nothing for Powerpoint. Then I grep'd the source code of file-5.16 and found this in readcdf.c:

} app2mime[] =  {
    { "Word",           "msword",       },
    { "Excel",          "vnd.ms-excel",     },
    { "Powerpoint",         "vnd.ms-powerpoint",    },
    { "Crystal Reports",        "x-rpt",        },
    { "Advanced Installer",     "vnd.ms-msi",       },
    { "InstallShield",      "vnd.ms-msi",       },
    { "Microsoft Patch Compiler",   "vnd.ms-msi",       },
    { "NAnt",           "vnd.ms-msi",       },
    { "Windows Installer",      "vnd.ms-msi",       },
    { NULL,             NULL,           },
},

Looks like in version 5.x of file, they use CDF to parse MS Office documents instead of magic numbers.

The major changes for 5.x are CDF file parsing, indirect magic, and
overhaul in mime and ascii encoding handling.

src/cdf.c - parser for Microsoft Compound Document Files
src/readcdf.c - CDF wrapper.

Support for audio/amr

What would I need to do to add support for "audio/amr" as a file type?

This file type can come from an audio text send from certain android phones (I think this is where it comes from).

You can see it listed here: https://www.twilio.com/docs/sms/accepted-mime-types and the product I work on does get these from time-to-time.
This type is also listed in /etc/mime.types on my linux system.

I'll gladly make the change/PR but I wanted guidance on what specifically to change besides adding it to the magic file.

Thank you for your time.

Partial Match happening for Jpeg Files

Hi Gray,
Since upgrading SimpleMagick to 1.10, JPEG images are getting partial matched. Is this change intentional?

Sample Images:

Also, I noticed common trait in the two image to have TIFF image data

$> file a.jpeg 
a.jpeg: JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=6, software=Aviary for Android 4.4.6, orientation=[*0*], model=SM-A500G, datetime=2016:02:20 12:50:44, manufacturer=SAMSUNG], baseline, precision 8, 530x714, frames 3

$> file b.jpeg 
b.jpeg: JPEG image data, Exif standard: [TIFF image data, little-endian, direntries=0], baseline, precision 8, 1800x2400, frames 3

P.S. This is the same user as @se7en007.

bug: Weird identification of heic image file

Example:
http://nokiatech.github.io/heif/content/images/autumn_1440x960.heic

autumn_1440x960.zip

got from:
http://nokiatech.github.io/heif/examples.html

What I get:

info:ISO, type OTHER, msg 'ISO Media'

How to use external magic db?

Hi,
I'm use this lib since some months and it's very useful but the problem I see it's the update of its magic db. I tried to use the files from https://github.com/file/file/tree/master/magic/Magdir, as reported in a issue from @stokito, concatenating all the files in a big txt file. But for some strange reason, the resulting db seems not letting correctly detect zip format, giving me "(0x%x)" as value when I call the method "getName()", while it's correct "zip" with the buil-in db. Does anyone else have the same problem or does know a way to load an updated version of magic db?
Thanks

Questions about compatibility with Android

From my understanding by reading the description, this library can be used to guess the type of a file based on its content, similar to the "file" command on linux.

Am I correct?
Could this also work on Android? If so, from which version?
Is it possible to use it via Gradle? If so, what is the dependency I need to add? Is there a way to get updated about the version ? Maybe here?
How much space does it add to the app's size?
Where can I see all of the supported mime-types that it can return? Does it have support for those that are supported by Android (here) ?

problems matching Windows executables

Hi,

I am writing a file analyser in java and plan to use your simplemagic application to determine the file type of an input stream. To use your application as comfortable as possible I compiled it into a jar file including your magic.gz file.

Unfortunately the findMatch Method always returns null for all types on inputstreams. I tried windows executables, bitmaps, plain text, ...
Is there any error in the current version of your code or do I just call it in a wrong way?

This is my test code:

        ContentInfoUtil util = new ContentInfoUtil();
        ContentInfo info = util.findMatch(new FileInputStream(new File("C:\\test\\bmp.bmp")));
        System.out.println(info != null ? info.getMessage() : "Is null");

You can see the folder structure of the simplemagic.jar file attached.
The com folder includes the class files of your code. The res folder includes the magic.gz dictionary file.

Thank you for your help.

Null testValue causing exception

Hello,

I'm getting an NPE when using magic files installed after making and installing file from source. Here's how I construct the ContentInfoUtil:

ContentInfoUtil util = new ContentInfoUtil("/usr/local/share/misc/magic");

This is the error it generates:

java.lang.NullPointerException
	at com.j256.simplemagic.types.StringType.getStartingBytes(StringType.java:97)
	at com.j256.simplemagic.entries.MagicEntry.getStartsWithByte(MagicEntry.java:96)
	at com.j256.simplemagic.entries.MagicEntries.optimizeFirstBytes(MagicEntries.java:90)
	at com.j256.simplemagic.ContentInfoUtil.readEntriesFromFile(ContentInfoUtil.java:341)
	at com.j256.simplemagic.ContentInfoUtil.<init>(ContentInfoUtil.java:156)
	at com.j256.simplemagic.ContentInfoUtil.<init>(ContentInfoUtil.java:103)

The issue seems to be that testValue is null on this line:

return ((TestInfo) testValue).getStartingBytes();

This happens because a lot of rules have an 'x' for the test value. The testValue is set to null in MagicEntryParser:

// process the test-string
Object testValue;
String testStr = parts[2];
if (testStr.equals("x")) {
	testValue = null;

I admit to not fully auditing the code. I just wanted to run it by you to make sure my problem isn't something stupid that I'm doing. I figured the official file magic should work but there are a lot of entries with null test values. Any ideas? Thanks.

mimeType = null for some mpg files

The attached file has mimetype='video/mpeg' but the simplemagic got null

trn_over_under_adj_2000.zip

Not working for txt file

Hi, I'm trying the library like this:

ContentInfoUtil contentInfoUtil = new ContentInfoUtil("/home/boris/.magic.mime");
ContentInfo info = contentInfoUtil.findMatch(path.toFile());

path is just a Path to a small text file named /tmp/PFZ6827520388559156905.txt with contents some text. If I'm using the built-in magic.mime file (i.e. I remove the argument to the constructor), info is null. If I use my magic.mime file (which I got from cloning https://github.com/file/file/ and concatenating all files in magic/Magdir) - info is not null, but rather: (NES, type OTHER, msg '(NES 2.0): 32x16k PRG, 116x8k CHR [V-mirror] [Trainer]' (that's the toString) and the mime type is null.

I'm not sure what's going on and what I'm doing wrong. It also doesn't work for .ods file (application/vnd.oasis.opendocument.spreadsheet), .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document, it returns application/zip), etc. For all of these the file command works fine. Where am I making a mistake? Isn't this library supposed to return the same things as the file command?

AWS cloudfront image return as content type html

Hi,

Look like AWS S3 return as content type as jpeg. Same file but on AWS cloud front return as content type as html.
Can you please take a look?

Cannot build from command line using mvn clean verify due to bad javadoc

Javadoc is pretty bad here. Can't build from commandline.

Support for Handlebars Template File

Hi,

As far as I've noticed, there is no support for handlebars template files scripts in ContentType enum with "hbs" extension and "text/x-handlebars-template" mime type. This should be added to the enum map.

-yhank you very much.

Handling of BOM leading characters

From @yongminyan .

Hey @j256 , I found these issues when I was parsing certain html content that start with BOM, like byte array of "-17, -69, -65, 60, 104, 116, 109, 108, 32" (the first three bytes are UTF-8 BOM and followed by <html tag) or "-1, -2, 60, 0, 104, 0, 116, 0, 109, 0, 108, 0" (the first two bytes are UTF-16 Little-Endian BOM and followed by <html tag), in these cases, the library failed to detect it as text/html, for it to be working, I think we need to fix the issues first and then add proper magic entries, something like

+0      byte 0xEF               
+!:mime text/html
+>1     byte 0xBB               
+>>2    byte 0xBF               UTF-8 Unicode text with BOM
+>>>3   search/1/cb \<html

and

+# UTF-16 LE
+0      byte 0xFF               
+!:mime text/html
+>1     byte 0xFE               
+>>1    lestring16 \<html                Little-endian UTF-16 Unicode text with BOM

I did not include the magic entries in the pull request as I feel those changes are not very generic, it could happen to other types like xml (i.e., different encoding), not too sure about the best solution?

Also I am not too sure lestring16/bestring16 support [Bbc] options or not, the magic5 spec does not say so, but I see lestring16/bestring16 extends from StringTypes, I mean can we do something like lestring16/cb or not?

It would be great if you can take a look and answer my two questions above, thanks a lot!

Possible bug inside of MagicEntryParser

com/j256/simplemagic/entries/MagicEntryParser.java:438

			if ("-".equals(offsetOperator)) {
				add = -add;
			} else if ("-".equals(offsetOperator)) {
				offset = add;
				add = 0;
			}

Here is duplicated conditions "-".equals(offsetOperator)

ContentInfoUtil.findMatch([77, 90]) generates NPE

We're using j256/simplemagic version 1.14 in a project and we noticed an NPE with the following call stack when attempting to find a match on a 2-byte prefix where the byte array is [77, 90] (hex [4d, 5a]):

java.lang.NullPointerException
        at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:134) ~[simplemagic-1.14.jar:?]
        at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:181) ~[simplemagic-1.14.jar:?]
        at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:66) ~[simplemagic-1.14.jar:?]
        at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:128) ~[simplemagic-1.14.jar:?]
        at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:122) ~[simplemagic-1.14.jar:?]
        at com.j256.simplemagic.ContentInfoUtil.findMatch(ContentInfoUtil.java:253) ~[simplemagic-1.14.jar:?]

This file is a 2-byte truncated windows executable.

Reading magic entries from system directory reads only the first file

Reading entries from system directory reads only the first file and then stops. So it's a bit strange behavior...

for (File subFile : fileOrDirectory.listFiles()) {
            FileReader reader = null;
            try {
                reader = new FileReader(subFile);
                return readEntries(reader);
            } catch (IOException e) {

Doesn't recognize bitmap files exported from GIMP

When I'm exporting images from GIMP as bitmap, this is not recognizing the magic number for those. When I run the file through xxd, I am getting:

00000000: 424d 7a75 0200 0000 0000 7a04 0000 6c00 BMzu......z...l.
00000010: 0000 9001 0000 9001 0000 0100 0800 0000 ................
00000020: 0000 0071 0200 232e 0000 232e 0000 0001 ...q..#...#.....
00000030: 0000 0001 0000 4247 5273 0000 0000 0000 ......BGRs......

Which does start with the 424d, but it fails to be recognized as a bitmap.

findMatch() reads too much bytes

Here is a source of the com.j256.simplemagic.ContentInfoUtil#findMatch(java.io.File) method:

	/**
	 * Number of bytes that the utility class by default reads to determine the content type information.
	 */
	public final static int DEFAULT_READ_SIZE = 10 * 1024;
	private int fileReadSize = DEFAULT_READ_SIZE;

	public ContentInfo findMatch(File file) throws IOException {
		int readSize = fileReadSize;
		if (file.length() < readSize) {
			readSize = (int) file.length();
		}
		if (readSize == 0) {
			return ContentInfo.EMPTY_INFO;
		}
		byte[] bytes = new byte[readSize];
		FileInputStream fis = null;
		try {
			fis = new FileInputStream(file);
			fis.read(bytes);
		} finally {
			closeQuietly(fis);
		}
		return findMatch(bytes);
	}

So things unclear for me:

Why DEFAULT_READ_SIZE is so big? 10 kibibytes looks too much if most files have only few magic bytes.
Derived from the first question: What is a reason to change the fileReadSize?

BTW Result of fis.read() is ignored, not sure if this safe.

Not recognizing Office 2007+ files (docx, xslx,...)

Hi, I found this project from your comment on the article http://www.rgagnon.com/javadetails/java-0487.html. I have used the UNIX "file" command with good accuracy, so seeing that the simplemagic library is based on the same logic appealed to me. Unfortunately this Java library doesn't have the same success rate. Particularly, it fails on most MS Office files from Office 2007+.

Here is what I get from SimpleMagic:

Word2007.docx:    application/zip [Zip archive data, at least v2.0 to extract]
Word97-2003.doc:  application/msword [Microsoft Word Document]
Excel2007.xlsx:   application/zip [Zip archive data, at least v2.0 to extract]
Excel97-2003.xls: null [OLE 2 Compound Document]

Here is what I expected using "file"

$ file --mime-type Word* Excel*
Word2007.docx:    application/vnd.openxmlformats-officedocument.wordprocessingml.document
Word97-2003.doc:  application/msword
Excel2007.xlsx:   application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Excel97-2003.xls: application/vnd.ms-excel

My system:

$ file --v
file-5.13
magic file from /usr/share/misc/magic

$ uname -a
CYGWIN_NT-6.1 XXXX 1.7.25(0.270/5/3) 2013-08-31 20:37 x86_64 Cygwin

I thought it might be due to an older magic file, but unfortunately, using my system's magic file doesn't help much (actually makes it worse). What version of file/magic was used here? Perhaps the file format of magic changed since then?

Docx file created by LibreOffice returns Zip mime type

I think its because LibreOffice (OpenOffice) put files into docx-archive in different order than MS Word.
So file have another signature and detects as simple zip-archive.

Example file is attached.
DocxByLibreOffice.docx

svg is not recognized

I used this svg file https://dev.w3.org/SVG/tools/svgweb/samples/svg-files/410.svg

    @Test
    public void testSvgBug() throws IOException {
        try (InputStream resource = getClass().getResourceAsStream("410.svg")) {
            byte[] byteArray = ByteStreams.toByteArray(resource);
            ContentInfoUtil contentInfoUtil = new ContentInfoUtil();
            ContentInfo match = contentInfoUtil.findMatch(byteArray);
            assertThat(match.getContentType()).isEqualTo(ContentType.SVG);
        }
    }

Seems like there is some issue in the magic.gz file

>>23    search/400      \<svg                   SVG Scalable Vector Graphics image
!:mime  image/svg+xml

well it results in svg is not added to com.j256.simplemagic.entries.MagicEntries#entryList :/

SVG will be detected as XML

The detection of SVG files is not correct. The utility identifies the type as XML, due to the XML header within SVG files. Possible solution is to check what comes after the xml header. There should be an "svg" tag.

Bash script not recognized correctly

A simple bash script with this header:

#!/usr/bin/env bash

will result in a ContentInfo with

name: a
contentType: OTHER
mimeType: null
message: a b script text executable

whereas command line file says Bourne-Again shell script, ASCII text executable

I used the same magic file for both simplemagic and file (/usr/share/file/magic).

webm being incorrectly read as application/octet-stream

Hi,
Seems like simplemagic fails detecting webm files and returns with ContentType OTHER as it's mimetype being read is "application/octet-stream" instead of "video/webm".

Sample webm file for which it failed:
http://techslides.com/demos/sample-videos/small.webm

MagicEntry parser doesn't handle octal and hex the same way as the "file" command.

The "man 5 magic" manual entry is unclear (to say the least) about how escapes are treated. However, if you look at the source-code for the "file" command, the treatment is as folllows:

an octal escape is "" followed by 1, 2 or 3 octal digits, and
a hex escape is a "" followed by 1 or 2 hex digits.

That's not what simplemagic does. It actually expects exactly 3 octal digits (or "\0" as a special case) and exactly 2 hex digits.

(This was determined by printing MagicEntry objects, and confirmed by source code examination.)

Can't read the correct content type for .xlsx files (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)

When I try to get the content type for the uploaded file test.xlsx, I got wrong content

ContentInfoUtil contentInfoUtil = new ContentInfoUtil();
ContentInfo contentInfo = contentInfoUtil.findMatch("D:/test.xlsx");
System.out.println(contentInfo.getMimeType());
System.out.println(contentInfo.getContentType());

I got this output

application/vnd.openxmlformats-officedocument
MICROSOFT_OFFICE

and it should be

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
MICROSOFT_EXCEL_XML

I attached the excel file that has this problem

Note: I tested it on version 1.16 and 1.17

test.xlsx

Webp are getting partial matched to Riff Images

From google docs, webp files are stored in RIFF container. If I run simplemagic on webp files, it gives There was a partial match, RIFF, type OTHER, msg 'RIFF (little-endian) data'.

1472812600656_679_92337e55-d140-4da1-8c79-94582c05ad61_540x360.webp.zip

Upgrade built-in magic file and use external on Linux

From what I see the magic.gz file which is included into the simplemagic.jar is outdated.
It was originally copied from CentOS and looks like you are updating it manually instead of copy it from the fresh /etc/magic.
Even more, the /etc/magic file is not present anymore on the current Linux distributions.
The MIME db file is part of the file utility which migrated to libmagic and changed the directories layout.
First of all the MIME types DB is located in the file's directory magic/Magdir. Then it's compiled to /usr/share/file/magic.mgc file.
Maybe MacOS or some FreeBSD still have the /etc/magic file but I don't think so.

So, here is few things:

It doesn't make any sense to mention the non existing /etc/magic file anymore or we have to mention that it may be absent.
It would be great to parse the compiled magic.mgc file instead and replace the internal magic.gz with the magic.mgc.
Even if we can't parse the compiled magic.mgc then at least we can update magic db from the actual magic/Magdir which is constantly updated.
I tried to use archive and compress files from the magic/Magdir but some entries are failed to parse. This means that their format is more advanced and the simplemagic should be adjusted.
The magic.mgc file is more than 5mb which is quite a lot it is worth to consider #63

.xls file shows null in return of getMimeType

Hi @j256 told me to put this issue here. I use this code to get the Mime Type of the excel file but it shows null in mimeType of ContentInfo. I can get the file format from message and name properties but I really need the mimeType
and extension. I've attached the file so you can test it too.

try {
   ContentInfoUtil util = new ContentInfoUtil();
   InputStream stream = new FileInputStream(file);
   ContentInfo info = util.findMatch(stream);
   
   if (info != null) {
       String mimeType1 = info.getMimeType();
   }
} catch (Exception e){
   e.printStackTrace();
}

1.xls

XLS and CSV files not recognized

Hello,

I tried to recognize several mime types, including XLS and CSV. Here are the results I have:

csv - NULL
doc - application/msword
docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
xls - NULL
xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
xml - application/xml

Is it a bug that CSV & XLS files are not recognized? Can I do something to recognized those mime types?

Thanks!

Request: add nullable annotations, to know what can return null

For example, I've noticed there is some case that "findMatch" can return null:

private ContentInfo findMatch(byte[] bytes, List<MagicEntry> entryList) {
    ....
    if (partialMatchInfo == null) {
        logger.trace("returning no match");
            return null;z

Slim version without internal config files

I see that the lib contains some DB files with magics but I need a small version of dependency which will use only system magic file. Is it possible?

Support for detecting Illustrator files

Would it be possible to add support for Illustrator files?

This seems to show that it's possible: https://asecuritysite.com/forensics/magic

pcapng filetype support

Hi, there is a pcapng file, and simplemagic check it as xml.
file_pcapng.zip

NegativeArraySizeException in PStringType

This test:

// Trigger (failed match for) the Digital Symphony
// sound sample (RISC OS) pattern:
byte[] x = new byte[] {
    0x2, 0x1, 0x13, 0x13, 0x13, 0x01, 0x0d, 0x10,
    0x1,
    // pstring: first byte is length. Can't do
    // do unsigned bytes in Java, so to do 
    // length of 255, use -1.
    -1, 0x1, 0x1, 0x1
};

ContentInfoUtil util = new ContentInfoUtil();
util.findMatch(x);

generates this exception:

java.lang.NegativeArraySizeException
at com.j256.simplemagic.types.PStringType.extractValueFromBytes(PStringType.java:23)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:149)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:185)
at com.j256.simplemagic.entries.MagicEntry.matchBytes(MagicEntry.java:68)
at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:150)
at com.j256.simplemagic.entries.MagicEntries.findMatch(MagicEntries.java:138)
at com.j256.simplemagic.ContentInfoUtil.findMatch(ContentInfoUtil.java:207)

I think the issue is that in PStringType, the string length is read as a signed byte.

System language dependant test failure

Current version is system language dependant for test execution so that having an spanish language, for example, decimal numbers are splitted by comma. Therefore a test failure is thrown with following logs:

Failed tests:   testFloat(com.j256.simplemagic.entries.FormatterTest): expected:<1[.]2> but was:<1[,]2>
  testFloatScientific(com.j256.simplemagic.entries.FormatterTest): expected:<1[.]2E0> but was:<1[,]2E0>
  testFloatMixed(com.j256.simplemagic.entries.FormatterTest): expected:<1[.]2> but was:<1[,]2>
  testFiles(com.j256.simplemagic.ContentInfoUtilTest): bad message for /files/x.nuv expected:<...progressive,aspect:1[.00,fps:29.]97> but was:<...progressive,aspect:1[,00,fps:29,]97>
  testPerformanceRun(com.j256.simplemagic.ContentInfoUtilTest): bad message for /files/x.nuv expected:<...progressive,aspect:1[.00,fps:29.]97> but was:<...progressive,aspect:1[,00,fps:29,]97>

Tests run: 250, Failures: 5, Errors: 0, Skipped: 2

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.280 s
[INFO] Finished at: 2018-02-05T14:54:27Z
[INFO] Final Memory: 25M/265M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project simplemagic: There are test failures.
[ERROR] 
[ERROR] Please refer to /home/simplemagic/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

In short, if you have your system language/maven language/java language in Spanish,

mvn install

command will fail.

My system settings are:

Apache Maven 3.5.0
Maven home: /usr/share/maven
Java version: 1.8.0_151, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: es_ES, platform encoding: UTF-8
OS name: "linux", version: "4.13.0-32-generic", arch: "amd64", family: "unix"

Can ContentInfoUtil be singleton, is this class thread safe

i want to know the class ContentInfoUtil is thread safe?

if i want to use it, do i must new it?

HTML detection fails with leading newline

Using version 1.8, simplemagic fails to recognize valid HTML input if the input starts with a newline:

@Test
public void testContentInfoUtil() {
    ContentInfoUtil ciu = new ContentInfoUtil();

    String someHtml = "<!doctype html><title>.</title>";
    ContentInfo someHtmlContentInfo = ciu.findMatch( someHtml.getBytes(StandardCharsets.UTF_8) );
    ContentType someHtmlContentType = someHtmlContentInfo.getContentType();
    assertTrue( ContentType.HTML.equals(someHtmlContentType) );
    // passes correctly

    String anotherHtml = "\n<!doctype html><title>.</title>"; // notice the leading newline
    ContentInfo anotherHtmlContentInfo = ciu.findMatch( anotherHtml.getBytes(StandardCharsets.UTF_8) );
    ContentType anotherHtmlContentType = anotherHtmlContentInfo.getContentType(); // java.lang.NullPointerException
    // from here onwards unreachable
    assertTrue( ContentType.HTML.equals(anotherHtmlContentType) );
}

According to https://validator.w3.org , leading newlines are acceptable for an input to be considered valid HTML.

Info Needed: Why Partial Match and what it does?

Hi Gray,
First of all, thank you for providing this small and to the point library. Amazing.

I have been checking the code and it looks like the Mime Type that is returned could be only a partial match:- MagicEntries.java:140.

Problem:- I am using your library to check mime type of the image file and then processing according to the content type(JPEG/PNG). Now, if partial match makes an image(unknown extension) match to of JPEG/PNG/any other common image format, the processing could end up being incorrect.

I have two questions to solve this problem :-

a. What was the need of Partial Match?
b. When Partial Match happens, what happens to ContentType - Multiple Values/Single Value always?

My Guess:
MIME information is stored as few bytes at the start of the file. You have a possibly a trie with all mime types. You are checking for mime of file in the trie and if the mime is not found, but starting few bytes of the file matched to a path in the trie, you append all leaf nodes under the path in that tree (Mime type starting with the same prefix as of mime of the file).

FYI:
Documentation in github wiki seems to point to urls which redirect to external websites.
Checked at http://www.redirect-checker.org/index.php

Output:

http://256.com
301 Moved Permanently
http://www.jinpai.com/Home/OnlineBid/buyout_details/mid/24379.html
200 OK

7zip has null mime-type

RegexType reads a line for every byte in mutableOffset.offset

I wanted to use my local machine’s /usr/share/file/magic/kml to detect KML and KMZ files, with this code:

ContentInfoUtil matcher =
    new ContentInfoUtil(new File("/usr/share/file/magic/kml"));

ContentInfo info = matcher.findMatch(new File(kmlFile));

But it always fails, because this line in the magic file never matches:

>>&0 regex ['"]http://earth.google.com/kml Google KML document

It appears this is because RegexType is reading an entire line for every byte in the mutableOffset, causing the matching content to be skipped entirely. In other words, if mutableOffset.offset is ten, the code reads ten lines, rather than limiting its scope to ten bytes.

I found I was able to get KML files to be correctly detected by changing these lines in RegexType.java from this:

if (i < mutableOffset.offset) {
    bytesOffset += line.length() + 1;
}

to this:

if (i < mutableOffset.offset) {
    bytesOffset += line.length() + 1;
    i += line.length();
}

Escapes are not processed in regexes

For example

0	regex		\\(;.*GM\\[[0-9]{1,2}\\]	Smart Game Format

This entry from the std magic file is rejected because a Pattern syntax-error exception is thrown. Pattern says that there should be a closing ) at the end of the regex.

What should happen is that the regex should be processed to handle magic escapes before it is passed to Pattern. This would turn the regex into \(;.*GM\[[0-9]{1,2}\] ... which is a valid Pattern string.

CMYK Jpeg files being incorrectly read with mime-type octet-stream

I tried reading mime type of a CMYK profile JPEG image. Expected mime type was image/jpeg, but the response was that of application/octet-stream.

Sample Image: CMYK JPEG Image

Expected output: image/jpeg
Actual output: application/octet-stream

Problem with detecting Illustrator file

This Illustrator file (Oxycodone_PK.zip) is not being detected as such.

I'm not quite sure how this file got into this state. I'm guessing that it started life as a PDF file, then got opened in Illustrator, and then saved as an Illustrator file. So this is kind of a corner case...

The property you're keying off of (dc:format) is there, and has the correct value. It just seems to have been moved around a bit.