Coder Social home page Coder Social logo

h2non / filetype.py Goto Github PK

View Code? Open in Web Editor NEW
609.0 14.0 107.0 1.36 MB

Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature

Home Page: https://h2non.github.io/filetype.py

License: MIT License

Makefile 1.67% Python 98.33%
magic-numbers filetype python mime extension type inference

filetype.py's Introduction

filetype.py PyPI Pyversions API

Small and dependency free Python package to infer file type and MIME type checking the magic numbers signature of a file or buffer.

This is a Python port from filetype Go package.

Features

  • Simple and friendly API
  • Supports a wide range of file types
  • Provides file extension and MIME type inference
  • File discovery by extension or MIME type
  • File discovery by kind (image, video, audio…)
  • Pluggable: add new custom type matchers
  • Fast, even processing large files
  • Only first 261 bytes representing the max file header is required, so you can just pass a list of bytes
  • Dependency free (just Python code, no C extensions, no libmagic bindings)
  • Cross-platform file recognition

Installation

pip install filetype

API

See annotated API reference.

Examples

Simple file type checking

Supported types

Image

  • dwg - image/vnd.dwg
  • xcf - image/x-xcf
  • jpg - image/jpeg
  • jpx - image/jpx
  • png - image/png
  • apng - image/apng
  • gif - image/gif
  • webp - image/webp
  • cr2 - image/x-canon-cr2
  • tif - image/tiff
  • bmp - image/bmp
  • jxr - image/vnd.ms-photo
  • psd - image/vnd.adobe.photoshop
  • ico - image/x-icon
  • heic - image/heic
  • avif - image/avif
  • qoi - image/qoi

Video

  • 3gp - video/3gpp
  • mp4 - video/mp4
  • m4v - video/x-m4v
  • mkv - video/x-matroska
  • webm - video/webm
  • mov - video/quicktime
  • avi - video/x-msvideo
  • wmv - video/x-ms-wmv
  • mpg - video/mpeg
  • flv - video/x-flv

Audio

  • aac - audio/aac
  • mid - audio/midi
  • mp3 - audio/mpeg
  • m4a - audio/mp4
  • ogg - audio/ogg
  • flac - audio/x-flac
  • wav - audio/x-wav
  • amr - audio/amr
  • aiff - audio/x-aiff

Archive

  • br - application/x-brotli
  • rpm - application/x-rpm
  • dcm - application/dicom
  • epub - application/epub+zip
  • zip - application/zip
  • tar - application/x-tar
  • rar - application/x-rar-compressed
  • gz - application/gzip
  • bz2 - application/x-bzip2
  • 7z - application/x-7z-compressed
  • xz - application/x-xz
  • pdf - application/pdf
  • exe - application/x-msdownload
  • swf - application/x-shockwave-flash
  • rtf - application/rtf
  • eot - application/octet-stream
  • ps - application/postscript
  • sqlite - application/x-sqlite3
  • nes - application/x-nintendo-nes-rom
  • crx - application/x-google-chrome-extension
  • cab - application/vnd.ms-cab-compressed
  • deb - application/x-deb
  • ar - application/x-unix-archive
  • Z - application/x-compress
  • lzo - application/x-lzop
  • lz - application/x-lzip
  • lz4 - application/x-lz4
  • zstd - application/zstd

Document

  • doc - application/msword
  • docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • odt - application/vnd.oasis.opendocument.text
  • xls - application/vnd.ms-excel
  • xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • ods - application/vnd.oasis.opendocument.spreadsheet
  • ppt - application/vnd.ms-powerpoint
  • pptx - application/vnd.openxmlformats-officedocument.presentationml.presentation
  • odp - application/vnd.oasis.opendocument.presentation

Font

  • woff - application/font-woff
  • woff2 - application/font-woff
  • ttf - application/font-sfnt
  • otf - application/font-sfnt

Application

  • wasm - application/wasm

filetype.py's People

Contributors

0xflotus avatar alejandrogallo avatar aluriak avatar andersk avatar anemele avatar babenek avatar blueyed avatar catkasha avatar danielswain avatar eribertomota avatar ferstar avatar gaul avatar geofmureithi avatar gforcada avatar h2non avatar hannesbraun avatar johnthagen avatar jorjmckie avatar jqqqqqqqqqq avatar levrik avatar liam-middlebrook avatar ltrojan avatar magbyr avatar mgorny avatar mikhailmurashov avatar pkravetskiy avatar rsabet avatar sayanarijit avatar vuolter avatar yanhuihang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

filetype.py's Issues

pip could not install

λ pip install filetype
Collecting filetype
  Using cached filetype-0.1.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "D:\LOCAL_TEMP\pip-build-605g52_c\filetype\setup.py", line 16, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "C:\Anaconda3\lib\codecs.py", line 895, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: 'D:\\LOCAL_TEMP\\pip-build-605g52_c\\filetype\\README.md'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in D:\LOCAL_TEMP\pip-build-605g52_c\filetype\

I could install by source, pip could not work.

add_type always fails

If you attempt to do an add_type with a subclass of Type, you always get the "instance must inherit from filetype.types.Type". This appears to be because isinstance only returns true for actual instances and not for subclasses. You need to use issubclass to check for subclasses. I am using Python 3.8, so this may be new behavior.

If I fix this error, I get a further buffer error. Attached is a zip file with my example code. You will need to change the file locations.
detect_file_type.zip

Add more file types: doc,docx,xls,xlsx and open office

Hello,

In the description you say "Pluggable: add new custom type matchers" , by the way the links doesn't work.

Do I need to modify your code or there is a function call or something, I don't see any example ?

docx is recognize as zip (which it is), I suppose I need a second step to extract the archive and do a check again ?

Price-matching other repos

Add Support for PathLike objects

Python 3.6 introduced the file system path protocol with PEP 519. All python (3.6+) builtins, and most (if not all) standard library modules accept a path-like object where only string or bytes were previously accepted. This is especially useful when using os.path alternatives like pathlib.

Any class can include this protocol by inheriting from os.PathLike and implementing a concrete definition of __fspath__ that returns either a str or bytes object.

The area most suitable for providing this support seems to be utils.get_bytes. While checking if isinstance(obj, os.PathLike) makes use of the formally defined interface, its only compatible with versions 3.6+. To reconcile this, this idiom recommended by PEP 518 offers compatibility with previous versions:

obj = obj.__fspath__() if hasattr(obj, '__fspath__') else obj

If a more explicit implementation is desired, a Python 2+3 compatible PathLike interface matching the signature provided by os.PathLike can be used as well.

import abc

# provides the features of the abc.ABC helper class introduced in 3.4
ABC = abc.ABCMeta('ABC', (object,), {'__slots__': ()}) 

class PathLike(ABC):
    """Abstract base class for implementing the file system path protocol."""

    @abc.abstractmethod
    def __fspath__(self):
        """Return the file system path representation of the object."""
        raise NotImplementedError

    @classmethod
    def __subclasshook__(cls, subclass):
        if cls is not PathLike:        
            return NotImplemented
        for parent in subclass.__mro__:            
            attrs = parent.__dict__
            if '__fspath__' in attrs:
                return NotImplemented if attrs['__fspath__'] is None else True
        return NotImplemented

Regardless of the implementation, I feel including PathLike support offers value without adding dependencies or introducing compatibility problems with older versions of Python.

support svg

it doesn't support svg format.
how can i do?
many are svg format now.
you have a go lang version, but how to use in python?

None returned for plain text files

I've read #30, but I find that having do something like this (pardon the comments, direct copy-paste):

    def is_mimetype_family(self, want_family):
        our_type = filetype.guess(self.path)

        if our_type is None:
            # sometimes, filetype fails horribly
            # # like with text files. works great for images though
            type, encoding = mimetypes.guess_type(self.path)
            if type is None:
                return False
            if re.match(want_family, type) is not None:
                return True
            return False
        if re.match(want_family, our_type.mime) is not None:
            return True
        return False

It's a little counter-intuitive. Perhaps, you could utilize mimetypes as I have done.

I expected filetype to be a "one-stop shop" for mimetypes. Nevertheless, it is an excellent library for detecting mimetypes.

Licensing for tests/fixtures/*

Hi @h2non,

I am trying put filetype.py in Debian. However, the tests/fixtures/* files seems are not originally developed by you and the Debian FTP Masters rejected the package. I can see it, in sample.jpg:

Copyright (c) 1998 Hewlett-Packard Company

Please, can you clarify?

I suggest you generate all files and put a specific notice about it.

Regards,

Eriberto

Not able to identify "tar" package

I created a simple tar package "dummy.tar" using "tar -cvf" and filetype always returns None.

I can see that tar is supported but not working for me

get_type('text/plain')

I can understand that detecting text/plain is hard.

But it would be great if I could guess the file extension if the mime-type is known.

Please make get_type('text/plain') work.

Thank you

AttributeError: 'function' object has no attribute 'archive'

FYI

import filetype
import gzip

>>> filetype.helpers.is_archive(gzip.compress(b'test'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/filetype/helpers.py", line 74, in is_archive
    return match.archive(obj) is not None
AttributeError: 'function' object has no attribute 'archive'

Seems there is a naming conflict when using helpers directly with the filetype.match.match and filetype.match.archive

Going around the helpers works though:

import gzip
from filetype.match import archive

>>> bool(archive(gzip.compress(b'test')))
True

Ebook support

Not every ebook is Epub, so these might be good to note.

  • MOBI
  • DJVU
  • AZW and AZW3
  • FB2

please add a image type named dcm

There is a new image type named dcm, which is always used in radiation medicine. CT or other radiation data are always record in this. I think filetype will be better if dcm is added in . Because now many researchers are studying on this kind image.

Check file type from request data

I am trying to get the file type from request data and save it, but the saved file can't use if I call filetype.guess( audio_file ).

I also try to save the data first, read the saved file and save again after i checked the file type, the saved file can't use too.

how to I solved this problem?

  • get the file type from request data and save it
audio_file = request.files['data']
kind = filetype.guess( audio_file )

if kind is not None :
    file_type = kind.extension
    wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], 'audio.'+file_type )
  • save the data first, read it and save again
audio_file = request.files['data']
tmp_path = os.path.join( current_app.config['UPLOAD_FOLDER'], "upload_audio.tmp" )
audio_file.save( tmp_path )

tmp_file = open( tmp_path , 'rb' )
tmp_data = tmp_file.read()
kind = filetype.guess( tmp_data )
tmp_file.close()

if kind is not None :
    file_type = kind.extension
    wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], wav_id+'.'+file_type )
    wav_file = open( wav_path , 'wb' )
    wav_file.write( tmp_data )
    wav_file.close()

Incorrect handling of CR2 files

Hello. I have a problem when trying to process Cr2 files. filetype recognize it as both tiff and cr2 type. It's not surprise since cr2 basen on tiff .

Filetype version 1.0.7
Sample code:

from filetype.types.image import Tiff, Cr2
from filetype import match
match("Path to cr2 file", matchers=[Cr2()])
match("Path to cr2 file", matchers=[Tiff()])

Result is:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])
<filetype.types.image.Tiff object at 0x0000029F0EF2CA20>

Should be:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])

You can take sample cr2 here
I think to solve this problem we need to add something like and not(buf[8] == 0x43 and buf[9] == 0x52)
here to make sure that there is no Cr2 magic word in buffer.

filetype should add filetype.BYTES_MINIMUM

From https://github.com/h2non/filetype.py/blob/v1.0.5/filetype/utils.py#L3-L18

_NUM_SIGNATURE_BYTES = 262

_NUM_SIGNATURE_BYTES number of bytes is read from passed data to determine the signature.
_NUM_SIGNATURE_BYTES could be considered the recommended number of bytes needed for best signature matching. Some users of the filetype may want to know the minumum number of bytes needed before calling any filetype API.
However, the variable is somewhat obscured; filetype.utils._NUM_SIGNATURE_BYTES is an awkward reference and the leading _ suggests a "private" variable.

_NUM_SIGNATURE_BYTES should be exposed at the root of the filetype package level as a Python "constant". , e.g. filetype.BYTES_MINIMUM or filetype.BYTES_SUGGESTED.

Stabilise API

Release 1.0.7 broke the specialised matchers that are still documented here https://h2non.github.io/filetype.py/v1.0.0/match.m.html

One could make the argument that these functions are internal API since they're not officially documented in the examples, so it's ok to break them without even a minor version bump.

However, given the usefulness of these functions (e.g. for scenarios in which one only looks for images -- something often encountered in web development) please expose them officially in the examples, and keep them stable.

AttributeError while guessing image's kind

>>> import filetype
>>> import requests
>>> url = "https://45.img.avito.st/image/1/lCc6lra4OM5MM8rIHszqVLE1PsiYNTjI_1Y-woo1OM7K"
>>> r = requests.get(url)
>>> filetype.guess(r.content[:10])
<filetype.types.image.Jpeg object at 0x10b209f70>
>>> filetype.guess(r.content[:10]).kind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Jpeg' object has no attribute 'kind'

From the initial response it would appear that the kind has been detected but once we call the method we have an AttributeError.

Python 3.8.0

Avi.match does not check byte 12 of file header

The last check in Avi.match is buf[10] == 0x49.

As far as I understand, the first four bytes is the RIFF signature (\x52\x49\x46\x46), followed by four bytes referring to the file size, followed by four bytes identifying the file type, which would be \x41\x56\x49\x20 in the case of an AVI.

Does the method lack a buf[11] == 0x20 check?

File does not evaluate mp4 properly.

This signature returns "None" filetype:

>>> kind = filetype.utils.get_signature_bytes('HDVWM419.mp4')
>>> print(kind)
bytearray(b"\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x8b_\x04\xf6mdat\x00\x01\'xe\xb8\x04_\xdb\xb3`+@\x85a)>\x10\x88\xebv{\x1ec(\x96(\xc9W^\x8a\\\xa7\xaaB$\xfaz\xb1\x8c&\xca\t\xa9\x04\x95\xb7\x87P\xf2\xaew~,\x8f\xa2\xda>\xe6\xe4\n$&\x9dm\x19\xeb4\xbd0\x00\xc6\x91\xf0\xb0\x85\x0f\xab<\x04\xf5\xe00\xaa\rm\xdc\xa6<a\x08\xcf\x8c\\\x0f\x18)\xdd\xc7\x8e\n\xd6\xd7\xd7\x05\x0fdPj\x15\x1f\xc5H\xd4\x98\x0cx\xce\xb9\xa7\xa8\t\xea\x8d\xe1\xb7\xe2F\x8fQoD\xadKT{\xc9D\xcapZ\xb8\xa2\xeez\xbd\xab\x9e7\x9a\xf7G\xbe/\xbdQ>P\xf6\xa3f\xdc\x17\xfb\xcb\x9c\x9a\x14\x06\xd4J\xb2\xe2\x15\x05\xda\xc5oL\x0b\xbd!\xb7>-\xe2\xb6\xda\x8bi\xab\x8c\xe3\xc1\xa7\x82c\x83\x93\x17$\xd9\xa8zM\xe4@Q\xab\\\xc5\xb4<\x04")

file HDVWM419.mp4
HDVWM419.mp4: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]


    Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, GBR), 1920x1080 [SAR 1:1 DAR 16:9], 12168 kb/s, 30 fps, 30 tbr, 30 tbn, 60

I guess the issues is using the correct magic file or metadata evaluation. I looked at your source code but not sure how to get it to see this as an MP4.

Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    creation_time   : 2016-09-19T13:43:30.000000Z
    encoder         : Lavf51.12.1

Videos with metadata that matches:
  Metadata:
    major_brand     : mp42
    minor_version   : 1
    compatible_brands: mp42mp41

detecting mp4 video

I have an mp4 video that returns on a call to _get_ftyp() like so:

('isom', 1, ['isom', 'avc1', 'mp42'])

Should the matching be more lenient for 'compatible brands'? I'm asking because I don't know what isom is and its unclear what the intention is with parsing out compatible brands.

API design

Based on Go implementation with Python idioms

conda install doesn't recognize /opt/conda/bin/python3.6

> ls -lah /opt/conda/bin/python3.6
-rwxrwxr-x 1 root root 3.6M Jun  8  2018 /opt/conda/bin/python3.6

>conda install filetype
...

The following NEW packages will be INSTALLED:
...
    filetype:                   1.0.7-pyh9f0ad1d_0         conda-forge

...


conda install filetype
ERROR conda.core.link:_execute(502): An error occurred while installing package 'conda-forge::filetype-1.0.7-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")
Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")


Update PyPI

It seems the latest version is still not updated on PyPI.
Maybe it's time to create a github workflow for automatic upload?

1.0.7 release fixes

The 1.0.7 release tarball is missing the sample.tar file used in test_infer_zip_from_disk and test_infer_tar_from_disk.
Also, the History.md contents end at version 1.0.5.

Why the original file has to be broken in get_bytes(obj)?

Here → obj = obj.read(_NUM_SIGNATURE_BYTES)

If I check my object type like if filetype.guess_mime(file) != "image/jpeg":, the file itself will be broken and I can't use it later.

I wonder if it is something intentioned or not.

Thank you!

pip install filetype doesn't work: ImportError: No module named filetype

Your code is great and it's helping me so much.
BTW I have to flag you a problem: pip install filetype doesn't work and your test code keep repeating:

Traceback (most recent call last):
  File "TestFiletype.py", line 4, in <module>
    import filetype
ImportError: No module named filetype

even if filetype is correclty installed:
correctlyinstalled
So I had to download your setup.py file but it wasn't installing because 'README.rst' was missing:
missinginstall
...really, do I need 'README.rst' in order to install your repository?
So I downloaded the complete .zip and filetype installed correctly and now everything works.

arriving to install filetype through pip install filetype would be amazing

Thank you to keep up the good code

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.