h2non / filetype.py Goto Github PK

View Code? Open in Web Editor NEW

609.0 14.0 107.0 1.36 MB

Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature

Home Page: https://h2non.github.io/filetype.py

License: MIT License

Makefile 1.67% Python 98.33%

magic-numbers filetype python mime extension type inference

filetype.py's Introduction

filetype.py

Small and dependency free Python package to infer file type and MIME type checking the magic numbers signature of a file or buffer.

This is a Python port from filetype Go package.

Features

Simple and friendly API
Supports a wide range of file types
Provides file extension and MIME type inference
File discovery by extension or MIME type
File discovery by kind (image, video, audio…)
Pluggable: add new custom type matchers
Fast, even processing large files
Only first 261 bytes representing the max file header is required, so you can just pass a list of bytes
Dependency free (just Python code, no C extensions, no libmagic bindings)
Cross-platform file recognition

Installation

pip install filetype

API

See annotated API reference.

Examples

Simple file type checking

import filetype

def main():
    kind = filetype.guess('tests/fixtures/sample.jpg')
    if kind is None:
        print('Cannot guess file type!')
        return

    print('File extension: %s' % kind.extension)
    print('File MIME type: %s' % kind.mime)

if __name__ == '__main__':
    main()

Supported types

Image

dwg - image/vnd.dwg
xcf - image/x-xcf
jpg - image/jpeg
jpx - image/jpx
png - image/png
apng - image/apng
gif - image/gif
webp - image/webp
cr2 - image/x-canon-cr2
tif - image/tiff
bmp - image/bmp
jxr - image/vnd.ms-photo
psd - image/vnd.adobe.photoshop
ico - image/x-icon
heic - image/heic
avif - image/avif
qoi - image/qoi

Video

3gp - video/3gpp
mp4 - video/mp4
m4v - video/x-m4v
mkv - video/x-matroska
webm - video/webm
mov - video/quicktime
avi - video/x-msvideo
wmv - video/x-ms-wmv
mpg - video/mpeg
flv - video/x-flv

Audio

aac - audio/aac
mid - audio/midi
mp3 - audio/mpeg
m4a - audio/mp4
ogg - audio/ogg
flac - audio/x-flac
wav - audio/x-wav
amr - audio/amr
aiff - audio/x-aiff

Document

doc - application/msword
docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
odt - application/vnd.oasis.opendocument.text
xls - application/vnd.ms-excel
xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
ods - application/vnd.oasis.opendocument.spreadsheet
ppt - application/vnd.ms-powerpoint
pptx - application/vnd.openxmlformats-officedocument.presentationml.presentation
odp - application/vnd.oasis.opendocument.presentation

Font

woff - application/font-woff
woff2 - application/font-woff
ttf - application/font-sfnt
otf - application/font-sfnt

Application

wasm - application/wasm

filetype.py's People

Contributors

Stargazers

Watchers

Forkers

tomas-fp atelier-cartographique aluriak jk128 liam-middlebrook pombredanne pkravetskiy vuolter geofmureithi-zz ninousf gaul lonjoy aim467 xiaolangyuxin cloudferro lucemia shulu tilkee heheddff amitlissack de-odex petergaultney dotlambda ltrojana3d papis jorjmckie crash-override404 cav71 yumeka999 simonw viponedream wuyazi froznfire xlotlu leocov-dev fzzylogic barseghyanartur ixna jcjordyn130 jqqqqqqqqqq danielswain ra2003 kostrub lippincott mgorny skyrookieyu xbabka01 bohdandatsko caidanw guest-li izut global-source dosas salehdehqanpour yanhuihang 0xflotus mono57 linewx yishan001 sjustfly imfantuan blueyed fighting332 jubrandt xinyang128 modulexcite rocker9527 fraang asfaltboy b4sen andersk haibin123456 python-repository-hub kentivo kamalyes jimwangzx babenek ferstar magbyr hamedzeinalzadeh kamuridesu otherwhitefrank elyk mitaka zzjpeter dylanjrae ennamarie19 mayhemheroes rsabet sayanarijit gforcada saber5433 oneforty4 mikhailmurashov lifefir3 elisco-latour imxeno123 anemele gpapa8x puntopunto

filetype.py's Issues

Use a file signatures table to speed up the file type recognition

I think that pre-build a dict and put there all the magic signatures for the file header lookup is more time efficient than call time to time each type object to find the matching file header.

pip could not install

λ pip install filetype
Collecting filetype
  Using cached filetype-0.1.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "D:\LOCAL_TEMP\pip-build-605g52_c\filetype\setup.py", line 16, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "C:\Anaconda3\lib\codecs.py", line 895, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: 'D:\\LOCAL_TEMP\\pip-build-605g52_c\\filetype\\README.md'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in D:\LOCAL_TEMP\pip-build-605g52_c\filetype\

I could install by source, pip could not work.

If you attempt to do an add_type with a subclass of Type, you always get the "instance must inherit from filetype.types.Type". This appears to be because isinstance only returns true for actual instances and not for subclasses. You need to use issubclass to check for subclasses. I am using Python 3.8, so this may be new behavior.

If I fix this error, I get a further buffer error. Attached is a zip file with my example code. You will need to change the file locations.
detect_file_type.zip

Add more file types: doc,docx,xls,xlsx and open office

Hello,

In the description you say "Pluggable: add new custom type matchers" , by the way the links doesn't work.

Do I need to modify your code or there is a function call or something, I don't see any example ?

docx is recognize as zip (which it is), I suppose I need a second step to extract the archive and do a check again ?

Add coveralls support

Price-matching other repos

These are Python repos with lots of file signatures that might not have been covered by filetype.py

https://github.com/floyernick/fleep-py/blob/master/fleep/data.json (193 stars)
https://github.com/h2non/filetype.py/tree/master/filetype/types (this repo)
https://github.com/openpreserve/fido/blob/master/fido/conf/format_extensions.xml (79 stars)
https://github.com/cdgriffith/puremagic/blob/master/puremagic/magic_data.json (47 stars)
https://github.com/omriher/Whatype/blob/master/whatype/magics.csv (12 stars)
https://github.com/schlerp/pyfsig/blob/master/src/pyfsig/file_signatures.py (9 stars)
https://github.com/7h3rAm/cigma/blob/master/cigma/magicbytes.json (1 star)

Add Support for PathLike objects

Python 3.6 introduced the file system path protocol with PEP 519. All python (3.6+) builtins, and most (if not all) standard library modules accept a path-like object where only string or bytes were previously accepted. This is especially useful when using os.path alternatives like pathlib.

Any class can include this protocol by inheriting from os.PathLike and implementing a concrete definition of __fspath__ that returns either a str or bytes object.

The area most suitable for providing this support seems to be utils.get_bytes. While checking if isinstance(obj, os.PathLike) makes use of the formally defined interface, its only compatible with versions 3.6+. To reconcile this, this idiom recommended by PEP 518 offers compatibility with previous versions:

obj = obj.__fspath__() if hasattr(obj, '__fspath__') else obj

If a more explicit implementation is desired, a Python 2+3 compatible PathLike interface matching the signature provided by os.PathLike can be used as well.

import abc

# provides the features of the abc.ABC helper class introduced in 3.4
ABC = abc.ABCMeta('ABC', (object,), {'__slots__': ()}) 

class PathLike(ABC):
    """Abstract base class for implementing the file system path protocol."""

    @abc.abstractmethod
    def __fspath__(self):
        """Return the file system path representation of the object."""
        raise NotImplementedError

    @classmethod
    def __subclasshook__(cls, subclass):
        if cls is not PathLike:        
            return NotImplemented
        for parent in subclass.__mro__:            
            attrs = parent.__dict__
            if '__fspath__' in attrs:
                return NotImplemented if attrs['__fspath__'] is None else True
        return NotImplemented

Regardless of the implementation, I feel including PathLike support offers value without adding dependencies or introducing compatibility problems with older versions of Python.

support svg

it doesn't support svg format.
how can i do?
many are svg format now.
you have a go lang version, but how to use in python?

Switch to pytest

I can do a PR.

Full test coverage

Add support for LX4 compression format

It would be great to have the support for the lx4 compression format.
Thanks a lot for making this nice package.

None returned for plain text files

I've read #30, but I find that having do something like this (pardon the comments, direct copy-paste):

    def is_mimetype_family(self, want_family):
        our_type = filetype.guess(self.path)

        if our_type is None:
            # sometimes, filetype fails horribly
            # # like with text files. works great for images though
            type, encoding = mimetypes.guess_type(self.path)
            if type is None:
                return False
            if re.match(want_family, type) is not None:
                return True
            return False
        if re.match(want_family, our_type.mime) is not None:
            return True
        return False

It's a little counter-intuitive. Perhaps, you could utilize mimetypes as I have done.

I expected filetype to be a "one-stop shop" for mimetypes. Nevertheless, it is an excellent library for detecting mimetypes.

Licensing for tests/fixtures/*

Hi @h2non,

I am trying put filetype.py in Debian. However, the tests/fixtures/* files seems are not originally developed by you and the Debian FTP Masters rejected the package. I can see it, in sample.jpg:

Please, can you clarify?

I suggest you generate all files and put a specific notice about it.

Regards,

Eriberto

Not able to identify "tar" package

I created a simple tar package "dummy.tar" using "tar -cvf" and filetype always returns None.

I can see that tar is supported but not working for me

get_type('text/plain')

I can understand that detecting text/plain is hard.

But it would be great if I could guess the file extension if the mime-type is known.

Please make get_type('text/plain') work.

Thank you

get_type uses string object identity instead of equality

Anything that isn't a literal will result in None.

from filetype import get_type
x = '.mp4'
get_type(ext=x.replace('.', ''))
>>> None
get_type(ext='mp4')
>>> <filetype.types.video.Mp4 object at 0x7f1f9c3bec90>

AttributeError: 'function' object has no attribute 'archive'

FYI

import filetype
import gzip

>>> filetype.helpers.is_archive(gzip.compress(b'test'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/filetype/helpers.py", line 74, in is_archive
    return match.archive(obj) is not None
AttributeError: 'function' object has no attribute 'archive'

Seems there is a naming conflict when using helpers directly with the filetype.match.match and filetype.match.archive

Going around the helpers works though:

import gzip
from filetype.match import archive

>>> bool(archive(gzip.compress(b'test')))
True

Ebook support

Not every ebook is Epub, so these might be good to note.

MOBI
DJVU
AZW and AZW3
FB2

Use py.test

Cleanup Makefile and use tox

The current makefile destroys my git repo refs on make clean.

I can do a PR

Any way to feed .guess() with bytes instead of a file ?

To be able to check file content in form data receiving from front end.

Feature: Recognize lzo compressed files

It would be great if lzo compressed files would be recognized.

i.e.

00000000  89 4c 5a 4f 00 0d 0a 1a  0a 10 30 20 80 09 40 03  |.LZO......0 ..@.|

please add a image type named dcm

There is a new image type named dcm, which is always used in radiation medicine. CT or other radiation data are always record in this. I think filetype will be better if dcm is added in . Because now many researchers are studying on this kind image.

Check file type from request data

I am trying to get the file type from request data and save it, but the saved file can't use if I call filetype.guess( audio_file ).

I also try to save the data first, read the saved file and save again after i checked the file type, the saved file can't use too.

how to I solved this problem?

get the file type from request data and save it

audio_file = request.files['data']
kind = filetype.guess( audio_file )

if kind is not None :
    file_type = kind.extension
    wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], 'audio.'+file_type )

save the data first, read it and save again

audio_file = request.files['data']
tmp_path = os.path.join( current_app.config['UPLOAD_FOLDER'], "upload_audio.tmp" )
audio_file.save( tmp_path )

tmp_file = open( tmp_path , 'rb' )
tmp_data = tmp_file.read()
kind = filetype.guess( tmp_data )
tmp_file.close()

if kind is not None :
    file_type = kind.extension
    wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], wav_id+'.'+file_type )
    wav_file = open( wav_path , 'wb' )
    wav_file.write( tmp_data )
    wav_file.close()

Incorrect handling of CR2 files

Hello. I have a problem when trying to process Cr2 files. filetype recognize it as both tiff and cr2 type. It's not surprise since cr2 basen on tiff .

Filetype version 1.0.7
Sample code:

from filetype.types.image import Tiff, Cr2
from filetype import match
match("Path to cr2 file", matchers=[Cr2()])
match("Path to cr2 file", matchers=[Tiff()])

Result is:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])
<filetype.types.image.Tiff object at 0x0000029F0EF2CA20>

Should be:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])

You can take sample cr2 here
I think to solve this problem we need to add something like and not(buf[8] == 0x43 and buf[9] == 0x52)
here to make sure that there is no Cr2 magic word in buffer.

[Feature request] Accept os.PathLike

Currently filetype.guess does not accept PathLike objects (see https://docs.python.org/3/library/os.html#os.PathLike) as used by pathlib.Path. It would be nice if this was possible.

Is there a way to read more than the first 256 Bytes?

I am trying to pass a long array of bytes to the function but the characters that identify the file go beyond 256 bytes.

filetype should add filetype.BYTES_MINIMUM

From https://github.com/h2non/filetype.py/blob/v1.0.5/filetype/utils.py#L3-L18

_NUM_SIGNATURE_BYTES = 262

_NUM_SIGNATURE_BYTES number of bytes is read from passed data to determine the signature.
_NUM_SIGNATURE_BYTES could be considered the recommended number of bytes needed for best signature matching. Some users of the filetype may want to know the minumum number of bytes needed before calling any filetype API.
However, the variable is somewhat obscured; filetype.utils._NUM_SIGNATURE_BYTES is an awkward reference and the leading _ suggests a "private" variable.

_NUM_SIGNATURE_BYTES should be exposed at the root of the filetype package level as a Python "constant". , e.g. filetype.BYTES_MINIMUM or filetype.BYTES_SUGGESTED.

Stabilise API

Release 1.0.7 broke the specialised matchers that are still documented here https://h2non.github.io/filetype.py/v1.0.0/match.m.html

One could make the argument that these functions are internal API since they're not officially documented in the examples, so it's ok to break them without even a minor version bump.

However, given the usefulness of these functions (e.g. for scenarios in which one only looks for images -- something often encountered in web development) please expose them officially in the examples, and keep them stable.

Is there any way to recognize whether it the file is a pdf file directly?

Add featured examples

XLS Support

Is it doable to also check for XLS?

Only first 261 bytes representing the max file header is required, so you can just pass a list of bytes

can you link to an example?

Support SVG images

Hi!

We have a use case to detect images including SVG. This library is perfect except for missing the SVG format. I'm happy to make a PR in the next few days if that's alright.

Edit with sample: https://upload.wikimedia.org/wikipedia/commons/0/02/SVG_logo.svg

AttributeError while guessing image's kind

>>> import filetype
>>> import requests
>>> url = "https://45.img.avito.st/image/1/lCc6lra4OM5MM8rIHszqVLE1PsiYNTjI_1Y-woo1OM7K"
>>> r = requests.get(url)
>>> filetype.guess(r.content[:10])
<filetype.types.image.Jpeg object at 0x10b209f70>
>>> filetype.guess(r.content[:10]).kind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Jpeg' object has no attribute 'kind'

From the initial response it would appear that the kind has been detected but once we call the method we have an AttributeError.

Python 3.8.0

Avi.match does not check byte 12 of file header

The last check in Avi.match is buf[10] == 0x49.

As far as I understand, the first four bytes is the RIFF signature (\x52\x49\x46\x46), followed by four bytes referring to the file size, followed by four bytes identifying the file type, which would be \x41\x56\x49\x20 in the case of an AVI.

Does the method lack a buf[11] == 0x20 check?

File does not evaluate mp4 properly.

This signature returns "None" filetype:

>>> kind = filetype.utils.get_signature_bytes('HDVWM419.mp4')
>>> print(kind)
bytearray(b"\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x8b_\x04\xf6mdat\x00\x01\'xe\xb8\x04_\xdb\xb3`+@\x85a)>\x10\x88\xebv{\x1ec(\x96(\xc9W^\x8a\\\xa7\xaaB$\xfaz\xb1\x8c&\xca\t\xa9\x04\x95\xb7\x87P\xf2\xaew~,\x8f\xa2\xda>\xe6\xe4\n$&\x9dm\x19\xeb4\xbd0\x00\xc6\x91\xf0\xb0\x85\x0f\xab<\x04\xf5\xe00\xaa\rm\xdc\xa6<a\x08\xcf\x8c\\\x0f\x18)\xdd\xc7\x8e\n\xd6\xd7\xd7\x05\x0fdPj\x15\x1f\xc5H\xd4\x98\x0cx\xce\xb9\xa7\xa8\t\xea\x8d\xe1\xb7\xe2F\x8fQoD\xadKT{\xc9D\xcapZ\xb8\xa2\xeez\xbd\xab\x9e7\x9a\xf7G\xbe/\xbdQ>P\xf6\xa3f\xdc\x17\xfb\xcb\x9c\x9a\x14\x06\xd4J\xb2\xe2\x15\x05\xda\xc5oL\x0b\xbd!\xb7>-\xe2\xb6\xda\x8bi\xab\x8c\xe3\xc1\xa7\x82c\x83\x93\x17$\xd9\xa8zM\xe4@Q\xab\\\xc5\xb4<\x04")

file HDVWM419.mp4
HDVWM419.mp4: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]


    Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, GBR), 1920x1080 [SAR 1:1 DAR 16:9], 12168 kb/s, 30 fps, 30 tbr, 30 tbn, 60

I guess the issues is using the correct magic file or metadata evaluation. I looked at your source code but not sure how to get it to see this as an MP4.

Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    creation_time   : 2016-09-19T13:43:30.000000Z
    encoder         : Lavf51.12.1

Videos with metadata that matches:
  Metadata:
    major_brand     : mp42
    minor_version   : 1
    compatible_brands: mp42mp41

detecting mp4 video

I have an mp4 video that returns on a call to _get_ftyp() like so:

('isom', 1, ['isom', 'avc1', 'mp42'])

Should the matching be more lenient for 'compatible brands'? I'm asking because I don't know what isom is and its unclear what the intention is with parsing out compatible brands.

API design

Based on Go implementation with Python idioms

WebM are not recognized

I have some webm and it always return None, like this one
http://video.webmfiles.org/big-buck-bunny_trailer.webm

conda install doesn't recognize /opt/conda/bin/python3.6

> ls -lah /opt/conda/bin/python3.6
-rwxrwxr-x 1 root root 3.6M Jun  8  2018 /opt/conda/bin/python3.6

>conda install filetype
...

The following NEW packages will be INSTALLED:
...
    filetype:                   1.0.7-pyh9f0ad1d_0         conda-forge

...


conda install filetype
ERROR conda.core.link:_execute(502): An error occurred while installing package 'conda-forge::filetype-1.0.7-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")
Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")

Update PyPI

It seems the latest version is still not updated on PyPI.
Maybe it's time to create a github workflow for automatic upload?

Add support for Brotli compression

Brotli is now a widely supported compression type. We should include that too.

1.0.7 release fixes

The 1.0.7 release tarball is missing the sample.tar file used in test_infer_zip_from_disk and test_infer_tar_from_disk.
Also, the History.md contents end at version 1.0.5.

Why the original file has to be broken in get_bytes(obj)?

Here → obj = obj.read(_NUM_SIGNATURE_BYTES)

If I check my object type like if filetype.guess_mime(file) != "image/jpeg":, the file itself will be broken and I can't use it later.

I wonder if it is something intentioned or not.

Thank you!

Unable to detect plain text

Filetype.guess on a plain text file always yields 'None'

Specific MP3 file not detected

test.zip
This mp3 is not detected as it.
It has a bit rate of 8 kbps and a sample rate or 24000 Hz. It is basically one second of silence.

PS. The go version has the same problem h2non/filetype#91

Fix pip package installation

pip install filetype doesn't work: ImportError: No module named filetype

Your code is great and it's helping me so much.
BTW I have to flag you a problem: pip install filetype doesn't work and your test code keep repeating:

Traceback (most recent call last):
  File "TestFiletype.py", line 4, in <module>
    import filetype
ImportError: No module named filetype

even if filetype is correclty installed:

So I had to download your setup.py file but it wasn't installing because 'README.rst' was missing:

...really, do I need 'README.rst' in order to install your repository?
So I downloaded the complete .zip and filetype installed correctly and now everything works.

arriving to install filetype through pip install filetype would be amazing

Thank you to keep up the good code