drewnoakes / metadata-extractor Goto Github PK

View Code? Open in Web Editor NEW

2.5K 126.0 469.0 12.48 MB

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files

License: Apache License 2.0

CSS 0.58% Java 99.41% DIGITAL Command Language 0.01%

java exif iptc xmp metadata icc jpeg webp quicktime mp4

metadata-extractor's Introduction

metadata-extractor is a Java library for reading metadata from media files.

Installation

The easiest way is to install the library via its Maven package.

<dependency>
  <groupId>com.drewnoakes</groupId>
  <artifactId>metadata-extractor</artifactId>
  <version>2.19.0</version>
</dependency>

Alternatively, download it from the releases page.

Usage

Metadata metadata = ImageMetadataReader.readMetadata(imagePath);

With that Metadata instance, you can iterate or query the various tag values that were read from the image.

Features

The library understands several formats of metadata, many of which may be present in a single image:

Exif
IPTC
XMP
JFIF / JFXX
ICC Profiles
Photoshop fields
WebP properties
WAV properties
AVI properties
PNG properties
BMP properties
GIF properties
ICO properties
PCX properties
QuickTime properties
MP4 properties

It will process files of type:

JPEG
TIFF
WebP
WAV
AVI
PSD
PNG
BMP
GIF
HEIF (HEIC & AVIF)
ICO
PCX
QuickTime
MP4
Camera Raw
- NEF (Nikon)
- CR2 (Canon)
- ORF (Olympus)
- ARW (Sony)
- RW2 (Panasonic)
- RWL (Leica)
- SRW (Samsung)

Camera-specific "makernote" data is decoded for cameras manufactured by:

Agfa
Apple
Canon
Casio
Epson
Fujifilm
Kodak
Kyocera
Leica
Minolta
Nikon
Olympus
Panasonic
Pentax
Reconyx
Sanyo
Sigma/Foveon
Sony

Read getting started for an introduction to the basics of using this library.

Questions & Feedback

The quickest way to have your questions answered is via Stack Overflow. Check whether your question has already been asked, and if not, ask a new one tagged with both metadata-extractor and java.

Bugs and feature requests should be provided via the project's issue tracker. Please attach sample images where possible as most issues cannot be investigated without an image.

Contributing

If you want to get your hands dirty, making a pull request is a great way to enhance the library. In general it's best to create an issue first that captures the problem you want to address. You can discuss your proposed solution in that issue. This gives others a chance to provide feedback before you spend your valuable time working on it.

An easier way to help is to contribute to the sample image file library used for research and testing.

Credits

This library is developed by Drew Noakes.

Thanks are due to the many users who sent in suggestions, bug reports, sample images from their cameras as well as encouragement. Wherever possible, they have been credited in the source code and commit logs.

Other languages

.NET metadata-extractor-dotnet is a complete port to C#, maintained alongside this library
PHP php-metadata-extractor wraps this Java project, making it available to users of PHP
Clojure exif-processor wraps this Java project, returning a subset of data

More information about this project is available at:

metadata-extractor's People

Contributors

Stargazers

Watchers

Forkers

sniderzero klebermaia ricardobochnia enebo huokedu revolc huanghuorong theefer rawbenny armaros jadcn rcketscientist sawatani yucenzhang jscottnz muh6mm3d fork-of rubyzhang palantir0 spanky762 clausneergaard ilshad shridharns nagyistoce royzeng ycaihua tableware wswenyue sarthakg veggiespam jianlinwei arjohnkampman codealligator draekko leonardovazmelo rdvdijk yunqiangshanbill siyantombela github201407 animalus zhaimi bumblebeeee dupanov marcosemiao marisaspark drmorr0 fuei tasfe 0mok nagix avishq smileyt niklasha lzaruba mvmn wsuetholz duxx0r junwuwei cshclm xelanimed tropian probestar 271845221 zhongjunhan houge357 apachesun tommyteavee i17c hutea rmuravel pitipongxyz out0fmemory roger-breton bezineb5 longxia1987 tspannhw iigacon jianfeihit tochange devaniyer ouichien natehsu elevenfive tballison quang-hiakari xiaoshi316 dabiaoluo maddude73 tomsmith-ai ivan-vinitskyy cnsuhao imakhalova poffo jessezhaordmp tatto1234 nadahar enfree hapit richiexy cutelitchi

metadata-extractor's Issues

Don't give access to non-public final refrences to mutable objects

Example from Directory:

protected final Collection<Tag> _definedTagList = new ArrayList<Tag>();

 public Collection<Tag> getTags()
 {
        return _definedTagList;
 }

Now look at the following code snippet:

 System.out.println(directory.getTagCount());
 directory.getTags().removeAll(directory.getTags());
 System.out.println(directory.getTagCount());

Example output:
21
0

The fix would be to return a copy return new ArrayList<Tag>(_definedTagList) because clients should not be able to modify _definedTagList since it's rather a part of the implementation than the API.

Exposing non-public final refrences to mutable objects to the public is never a good idea. For an in depth discussion about this topic I would recommend chapter 4 of "Effective Java" by Joshua Bloch.

Extract IPTC data from TIFF files

Currently IPTC data is only read from JPEG files.

There was some reference of this issue in Google Code issue 40. There is a sample image attached that should contain IPTC data for processing.

Deploy to Maven Central

Maven is the defacto dependency management system.

Either get builds into Maven Central, or use an alternative repository such as Sonatype.

Some old versions are already in Maven central.

(migrated from Google Code)

Xerces maven dependency required?

It seems like the xerces maven dependency is unused (it also not mentioned in the build.xml) and build works without this dependency. Thus I would suggest we remove this dependency since it currently causes problems for apache tika (https://issues.apache.org/jira/browse/TIKA-1154).

xerces
xercesImpl
2.8.1

Could not determine file's magic number for 0 byte files

If a file is zero bytes we should throw a more specific exception like an EmptyFileException (see also #21).

Review whether the bit masks are needed in SequentialReader and RandomAccessReader

The getUIntX and getIntX methods can be written in are more elegant way. By the way if you look closely you will see that even the second version can be written in a better way but I don't have time for that now. I will do it maybe tomorrow or at Monday.

        //u16
        (getByte() & 0xFF ) << 8 | getByte() & 0xFF; 
        //s16
        getByte() << 8 | getByte() & 0xFF; 
        //u32
        (getByte() & 0xFFL) << 24 | (getByte() & 0xFFL) << 16 | (getByte() & 0xFFL) << 8 | getByte() & 0xFFL
        //s32
        getByte() << 24 | (getByte() & 0xFFL) << 16 | (getByte() & 0xFFL) << 8 | getByte() & 0xFFL

Over even better:

private static long getUnsignedByte() 
{
    return  getByte() & 0xFFL;
}

        //u16
        getUnsignedByte()  << 8 | getUnsignedByte() ; 
        //s16
        getByte() << 8 | getUnsignedByte() ;
        //u32
        getUnsignedByte()  << 24 | getUnsignedByte() << 16 | getUnsignedByte()  << 8 | getUnsignedByte() 
        //s32
        getByte()  << 24 | getUnsignedByte() << 16 | getUnsignedByte()  << 8 | getUnsignedByte()

Refactor duplicated TIFF processing code

There's some very similar code in ExifReader and TiffReader. Essentially these are the same format, so refactor this out.

Unable to read GIF files

I have been testing out metadata-extractor for the past few hours. I have been able to get it to read all file types except for GIF. As soon as it hits that type of file, I get the following stack trace:

Caused by: com.drew.imaging.ImageProcessingException: File format is not supported
at com.drew.imaging.ImageMetadataReader.readMetadata(Unknown Source)
at com.drew.imaging.ImageMetadataReader.readMetadata(Unknown Source)
at Find$Finder.find(Find.java:81)
... 10 more

I am using version 2.6.4. Is this a known issue?

Obtain sample images from popular camers

It's been a while since the image database had many photos added, and there are a lot of new and popular cameras out there:

https://www.flickr.com/cameras

Find a way to get sample images for these models and verify they're processed correctly.

JPEG segment check failing on valid images

It seems there are quite a few images that seem valid failing on this check.

I have read here that the reasoning is that it isn't valid JPEG data - but if it is happening repeatedly, surely it must be?

When running the image through metapicz it seems to work with all metadata intact.

Perhaps this is a superfluous check?

Example image

File format is not supported [png special case]

This one is a bit tricky. The sample image is detected by most applications as a valid png file. However Metadata-extractor does not recognize it as png.

Original:

Modified version with metadata:

Certain classes should override toString()

Overriding the toString() method from Object should be done were it seems appropriate. An example for a missing toString() method is the Metadata class.

Excerpt from the toString() javadoc: "The result should be a concise but informative representation that is easy for a person to read."

Review raised exceptions

Currently ImageProcessingException covers a wide range of exceptional circumstances. Review these and determine whether subclassing this exception would make sense.

For example UnsupportedImageFormatException.

(Adapted from Google Code issue 91)

Possible OutOfMemoryException when reading certain large TIFF files

TIFF files can store huge data buffers in tags which TiffReader happily loads into memory.

For example:

[Exif IFD0] Unknown tag (0x935c) = [443064764 bytes]

This can cause an error such as:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  at com.drew.lang.RandomAccessFileReader.getBytes(Unknown Source)
  at com.drew.metadata.exif.ExifReader.processTag(Unknown Source)
  at com.drew.metadata.exif.ExifReader.processDirectory(Unknown Source)
  at com.drew.metadata.exif.ExifReader.extractIFD(Unknown Source)
  at com.drew.metadata.exif.ExifReader.extractTiff(Unknown Source)
  at com.drew.imaging.tiff.TiffMetadataReader.readMetadata(Unknown Source)
  at com.drew.imaging.ImageMetadataReader.readMetadata(Unknown Source)
  at com.drew.imaging.ImageMetadataReader.readMetadata(Unknown Source)

It'd be sensible to allow specifying a maximum tag size, or perhaps a list of tags to ignore, or maybe something else.

(Migrated from a Google Code issue)

IPTC character encoding

IPTC character encoding is assumed to match the system default (via system property file.encoding), which is incorrect.

It may be possible to use the IPTC CodedCharacterSet tag to determine the encoding. Otherwise the user should be able to specify an encoding at read time.

There is quite some discussion about this problem at Google Code issue 38, from which this was migrated.

Malformed Javadoc comments

The JDK 8 Javadoc tool has a new feature called DocLint. DocLint can check for malformed Javadoc comments and is enabled by default. We should use DocLint to correct the existing JavaDocs.

Review clones made on Google Code

During the project's time on Google Code, 31 clones were made, of which it seems that five had changes pushed.

Review these for features to merge.

https://code.google.com/p/metadata-extractor/source/clones

amandra - XMP/Android related changes
~~nkeating - cross locale unit tests (obsolete)~~
~~sampisa - IPTC in TIFF~~
~~raygauss-master / raygauss-2.6.2-maven - Maven/Sonatype changes~~
~~farrukhnajm - Maven changes~~

Support Olympus ORF camera RAW format

Olympus cameras such as the Pen E-PL1 produce ORF files. metadata-extractor parses the TIFF successfully, but the tags are unknown or even incorrectly presented.

Note that Exiftool can process these files.

(Migrated from Google Code issue 43)

Pentax K5 makernote incorrectly detected as Casio type 2

No sample image exists in the directory, however they may be found online.

(Migrated from Google Code issue 86)

Roadmap for 2.8 – Suggestions are welcome!

If you as a user want something to be done in the next release, feel free to post below.

@drewnoakes
If I recall correctly you wanted to release 2.8 in January. I would prefer not to release before 11.1. As I already mentioned I think we should focus on supporting more formats respectively improve support for those which are already supported, f.e. gif and png, for 2.8.

Support Canon CRW camera RAW format

The older CRW file format (superseded by CR2) is treated as a TIFF file, however it does not meet the library's expectations of TIFF files.

When TiffMetadataReader attempts to verify them, it falls over with ExifReader expecting 49492a but getting 49491a due to the difference in file format.

CRW data is stored as CIFF (Camera Image File Format) which is similar to TIFF, but differs.

Specifications at:

http://www.sno.phy.queensu.ca/~phil/exiftool/canon_raw.html

(Migrated from Google Code issue 42)

Be more permissive when encountering invalid TIFF format codes

When an invalid TIFF format code is observed, earlier versions of the library would attempt to continue processing TIFF data. However sometimes this continue to interpret random bytes as meaningful data, producing randomised and misleading output. So in version 2.6.4, an invalid TIFF format code was considered a significant enough indication that processing should halt.

At the time this didn't adversely affect any of the images in the database.

However, one was found and reported in Google Code issue 94.

exiftool is able to process this file successfully.

The task here is to determine whether there is a safe way to allow processing to continue in the face of such an error. In general, sticking to the spec is useful and defensible, but in practice it can inconvenience some users.

Investigate a Travis-CI build

Once configured and passing, add the build status image to the README:

https://api.travis-ci.org/drewnoakes/metadata-extractor.svg

Builds available at: https://travis-ci.org/drewnoakes/metadata-extractor

Review StreamReader.skip logic

It may not be possible to provide a robust implementation of StreamReader.trySkip that works for all underlying InputStream types.

http://stackoverflow.com/a/14400985/24874

(Migrated from a Google Code issue)

Support Sony ARW camera RAW format

Sony cameras such as the Nex7 produce ARW files. metadata-extractor parses the TIFF successfully, but the tags are unknown or even incorrectly presented.

Note that Exiftool can process these files.

(Migrated from Google Code issue 35)

Collect list of relevant resources on the wiki

Including, but not limited to:

Metadata formats
File formats
Makernotes

As:

Links to external sites
Scraped data
PDF files

MetadataReader Interface rework

The MetadataReader is only used by PsdReader and not by any of the Classes which actually contain MetadataReader in their name. Thus the MetadataReader interface requires a rework.

Observed multiple instances of PNG chunk 'mkBT', for which multiples are not allowed

It seems like multiple instances of chunk 'mkBT' in png files are not that uncommon. I've got this particular messages for many png files. Also got some for other chunks but they were very rare compared to 'mkBT'.

We should investigate if multiple instances of 'mkBT' always indicates an error. Also we would need a way to deal with them. ExifTool can read the metadata without any error.

Sample:

Upgrade 2.6 - 2.7: Image throwing "cannot set a null String"

After upgrading from 0.6 - 0.7, I have had a few images throwing this error. I am trying to figure out what it might be, but reverting back to 0.6 seems to work fine.

Sample image

ProcessAllImagesInFolderUtility should create a metadata directory if it is missing

Currently if the metadata directory does not exist in a given directory it will not write the file because it can't find the path.

File format is not supported [some gif files]

Some gifs are not processed correctly.

Example image:

ExifTool reads the file correctly. If needed I can provide more gif images which are not supported at the moment.

Improve Makernote support

There are a lot of different makernote tags out there. Many are documented online:

Hand-coding classes for all of these formats may not be the best approach. Some analysis could be done to see whether this could be data-driven (say from an XML file, for example), either at runtime or design time via codegen.

Adding support for these makernotes is quite easy, and a great place to get started if you want to contribute to this library.

(Migrated from Google Code issue 8)

Support Panasonic RW2 camera RAW format

Panasonic cameras such as the Lumix DMC-GF1 and GF3 produce RW2 files. metadata-extractor parses the TIFF successfully, but the tags are unknown or even incorrectly presented.

Note that Exiftool can process these files.

(Migrated from Google Code issue 35)

Produce values derived from one or more tags

There are many cases where answering a question about an image may involve reading multiple different tags, possibly from different directories.

Dealing with redundancy

Examples:

image width (equally height) may be obtained from the JpegDirectory and ExifIFD0Directory
There is often multiple ways to obtain exposure time
XMP duplicates a lot of existing tags

Devise a strategy that sits on top of the directories and tags for extracting certain commonly used values according to well tested heuristics. One challenge here is that tags may not agree and it may be unclear which to trust.

(Migrated from Google Code issue 26)

Grouping values

Sometimes multiple tags should be combined to produce one logical 'value':

GPS lat / lng
Date & time values (i.e. in IPTC data)
Aspect ratio (#494)

JavaDoc cross referencing

Convert <code>Foo</code> style comments to {@Foo} style, where sensible.

Specify source code encoding in ant build

Building in some environments can fail when trying to map UTF-8 characters to ASCII:

[javac] /path/Source/com/drew/lang/GeoLocation.java:81: error: unmappable character for encoding ASCII
[javac]         return dms[0] + "?? " + dms[1] + "' " + dms[2] + '"';
[javac]                          ^

@rosset.filipe suggests adding the following to all javac tasks of build.xml:

encoding="UTF-8"

(Migrated from Google Code issue 90)

Release date for 2.7

I would suggest the following release cycle:
1st week of December 2.7 release candidate
2nd week of December 2.7 official release

Open issues which should be done till release: 1, 3-9, 12, 36, 38

What do you think?

Support writing metadata

Currently metadata-extractor provides a read-only view onto the metadata within files.

Several use cases would benefit from or require the ability to write data back to files, such as comments, GPS location, image orientation, image size...

The implementation of this feature is non-trivial. Not all types of metadata can or should be modified, and of course there is a high cost associated with bugs that occur when people are overwriting their files, should images be lost.

Given that the library supports many types of metadata, it's more realistic to roll out support for writing different types of metadata incrementally. The first type to be attempted should probably be Exif and the first container type would likely be JPEG.

(Migrated from Google Code issue 66)

Review changes from clone amandra-xmp and merge it

https://code.google.com/r/amandra-xmp/source/browse

Additional tags for ExifSubIFDDirectory

New tags for ExifSubIFDDirectory:

public static final int TAG_RELATED_IMAGE_FILE_FORMAT = 0x1000; 
public static final int TAG_RELATED_IMAGE_WIDTH = 0x1001;
public static final int TAG_RELATED_IMAGE_LENGTH = 0x1002;
public static final int TAG_TRANSFER_RANGE = 0x0156 ;
public static final int TAG_JPEG_PROC = 0x0200;
public static final int TAG_MAKER_NOTE = 0x927C;
public static final int TAG_INTEROPERABILITY_OFFSET = 0xA005;

_tagNameMap.put(TAG_RELATED_IMAGE_FILE_FORMAT, "Related Image File Format"); 
_tagNameMap.put(TAG_RELATED_IMAGE_WIDTH, "Related Image Width");
_tagNameMap.put(TAG_RELATED_IMAGE_LENGTH, "Related Image Length");
_tagNameMap.put(TAG_TRANSFER_RANGE, "Transfer Range");
_tagNameMap.put(TAG_JPEG_PROC, "JPEG Proc");
_tagNameMap.put(TAG_COMPRESSED_AVERAGE_BITS_PER_PIXEL, "Compressed Bits Per Pixel");
_tagNameMap.put(TAG_MAKER_NOTE, "Maker Note");
_tagNameMap.put(TAG_INTEROPERABILITY_OFFSET, "Interoperability Offset");

(Migrated from a Google Code issue)

Add missing @Override annotations

Already done in fe4e0f8

Review shared tags between various Exif directories

The ExifIFD0Directory, ExifSubIFDDirectory and ExifIFD1Directory classes share some common tags. These were once merged into a single directory, however there would be conflicts between values from, for example, the image and its thumbnail.

These directories have now been split as described above, but I'm not convinced the code here is quite right. The Exif spec needs a thorough read and the code a review.

(migrated from a Google Code issue)

GPS location fails when component has zero value

When a latitude or longitude has a zero component in Exif data (that is, modelled as a rational with a zero-valued numerator), the rational reports a NaN value which then causes the value to be reported as NaN or null.

(Migrated from Google Code issue 84)

Additional tags for ExifIFD0Directory

New tags for ExifIFD0Directory:

public static final int TAG_NEW_SUBFILE_TYPE = 0x00fe; 
public static final int TAG_IMAGE_WIDTH = 0x0100;
public static final int TAG_IMAGE_HEIGHT = 0x0101; 
public static final int TAG_BITS_PER_SAMPLE = 0x0102; 
public static final int TAG_COMPRESSION = 0x0103; 
public static final int TAG_PHOTOMETRIC_INTERPRETATION = 0x0106; 

public static final int TAG_SAMPLES_PER_PIXEL = 0x0115;
public static final int TAG_ROWS_PER_STRIP = 0x0116;
public static final int TAG_STRIP_BYTE_COUNTS = 0x0117;
public static final int TAG_STRIP_OFFSETS = 0x0111;

public static final int TAG_PLANAR_CONFIGURATION = 0x011C; // BUG: same value as below
public static final int TAG_SUB_IFDS = 0x011C;

public static final int TAG_DATE_TIME_ORIGINAL = 0x9003;
public static final int TAG_TIFF_EP_STANDARD_ID = 0x9216;

_tagNameMap.put(TAG_NEW_SUBFILE_TYPE, "New Subfile Type");
_tagNameMap.put(TAG_IMAGE_WIDTH, "Image Width");
_tagNameMap.put(TAG_IMAGE_HEIGHT, "Image Height");
_tagNameMap.put(TAG_BITS_PER_SAMPLE, "Bits Per Sample");
_tagNameMap.put(TAG_COMPRESSION, "Compression");
_tagNameMap.put(TAG_PHOTOMETRIC_INTERPRETATION, "Photometric Interpretation");

_tagNameMap.put(TAG_SAMPLES_PER_PIXEL, "Samples Per_Pixel");
_tagNameMap.put(TAG_ROWS_PER_STRIP, "Rows Per Strip");
_tagNameMap.put(TAG_STRIP_BYTE_COUNTS, "Strip Byte Counts");
_tagNameMap.put(TAG_STRIP_OFFSETS, "Strip Offsets");

_tagNameMap.put(TAG_PLANAR_CONFIGURATION, "Planar configuration");
_tagNameMap.put(TAG_SUB_IFDS, "tag Sub IFDs");

_tagNameMap.put(TAG_DATE_TIME_ORIGINAL, "Date Time Original");
_tagNameMap.put(TAG_TIFF_EP_STANDARD_ID, "Tiff EP Standard ID");

(Migrated from a Google Code issue)

Investigate test coverage reports from coveralls.io

https://coveralls.io
https://coveralls.io/r/drewnoakes/metadata-extractor

Needs a .coveralls.yml file with appropriate settings.

Buffer over/underflows in PhotoshopReader

[Photoshop] Number of requested bytes cannot be negative
[Photoshop] Attempt to read from beyond end of underlying data source

These two errors are really common and should not happen that often. This issue may be related to other existing issues.

Include version information in manifest

The library should include vendor, implementation title and implementation version metadata for the package. This is achieved via the meta-inf/manifest.mf file within the JAR file.

Such data may then be obtained via code such as:

Package p = com.drew.imaging.ImageMetadataReader.class.getPackage();

String title = p.getImplementationTitle();
String vendor = p.getImplementationVendor();
String version = p.getImplementationVersion();

Manifest entries:

Implementation-Title: metadata-extractor
Implementation-Version: 2.6.4
Implementation-Vendor: drewnoakes.com

Review whether it's feasible to unit test that these values are present. Builds won't always be run from JAR files however.

(Migrated from Google Code issue 78)

Support .editorconfig

http://editorconfig.org

Unit tests should pass in all locales

Some unit tests involve culture-sensitive formatting of values and as such can fail in different cultures/locales.

NikonType2MakernoteTest1#testGetAutoFlashCompensationDescriptionfails with 0,67 EV not 0.67 EV (in Germany)
PngMetadataReaderTest#testGimpGreyscaleWithManyChunks fails with Mon Dec 31 23:08:30 EST 2012 instead of Tue Jan 01 04:08:30 GMT 2013 (in EST).

This may also be the case for other tests.

Preference is to force the culture during unit testing rather than modifying the code under test to always use the en-GB culture. Users should get a format that's suited to their culture.

http://stackoverflow.com/questions/8190124/junit-testing-double-tostring-in-multiple-cultures

(Migrated from Google Code issues 29 and 92)

Should we migrate the targeted Java version from 1.5 to 1.6

The Java version should 1.6 instead of 1.5 according to the changelog.

drewnoakes / metadata-extractor Goto Github PK

metadata-extractor's Introduction

Installation

Usage

Features

Questions & Feedback

Contributing

Credits

Other languages

metadata-extractor's People

Contributors

Stargazers

Watchers

Forkers

metadata-extractor's Issues

Dealing with redundancy

Grouping values

Recommend Projects

Recommend Topics

Recommend Org