Coder Social home page Coder Social logo

knjcode / imgdupes Goto Github PK

View Code? Open in Web Editor NEW
336.0 5.0 23.0 950 KB

Identifying and removing near-duplicate images using perceptual hashing.

Python 96.34% Makefile 1.48% Dockerfile 2.17%
image dedupe perceptual-hashing perceptual-hashes deduplicate

imgdupes's Introduction

imgdupes

imgdupes is a command line tool for checking and deleting near-duplicate images based on perceptual hash from the target directory.

video_capture Images by Caltech 101 dataset that semi-deduped for demonstration.

It is better to pre-deduplicate identical images with fdupes or jdupes in advance.
Then, you can check and delete near-duplicate images using imgdupes with an operation similar to the fdupes command.

For large dataset

It is possible to speed up dedupe process by approximate nearest neighbor search of hamming distance using NGT or hnsw. See Against large dataset section for details.

Install

To install, simply use pip:

$ pip install imgdupes

Usage

The following example is sample command to find sets of near-duplicate images with Hamming distance of phash less than 4 from the target directory.
To search images recursively from the target directory, add -r or --recursive option.

$ imgdupes --recursive target_dir phash 4
target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

target_dir/watch_0122.jpg
target_dir/watch_0121.jpg

By default, imgdupes displays a list of duplicate images list and exits.
To display preserve or delete images prompt, use the -d or --delete option.

If you are using iTerm 2, you can display a set of images on the terminal with the -c or --imgcat option.

$ imgdupes --recursive --delete --imgcat 101_ObjectCategories phash 4

The set of images are sorted in ascending order of file size and displayed together with the pixel size of the image, you can choose which image to preserve.

With -N or --noprompt option, you can preserve the first file in each set of duplicates and delete the rest without prompting.

$ imgdupes -rdN 101_ObjectCategories phash 0

To take input from a list of files

Use --files-from or -T option to take input from a list of files.

$ imgdupes -T image_list.txt phash 0

For example, create image_list.txt as below.

101_ObjectCategories/Faces/image_0345.jpg
101_ObjectCategories/Motorbikes/image_0269.jpg
101_ObjectCategories/Motorbikes/image_0735.jpg
101_ObjectCategories/brain/image_0047.jpg
101_ObjectCategories/headphone/image_0034.jpg
101_ObjectCategories/dollar_bill/image_0038.jpg
101_ObjectCategories/ferry/image_0020.jpg
101_ObjectCategories/tick/image_0049.jpg
101_ObjectCategories/Faces_easy/image_0283.jpg
101_ObjectCategories/watch/image_0171.jpg

Find near-duplicated images from an image you specified

Use --query option to specify a query image file.

$ imgdupes --recursive target_dir --query target_dir/airplane_0583.jpg phash 4
Query: sample_airplane.png

target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

Against large dataset

imgdupes supports approximate nearest neighbor search of hamming distance using NGT or hnsw.

To dedupe images using NGT, run with --ngt option after installing NGT and python binding.

$ imgdupes -rdc --ngt 101_ObjectCategories phash 4

Notice: --ngt option is enabled by default from version 0.1.0.

For instructions on installing NGT and python binding, see NGT and python NGT.

To dedupe images using hnsw, run with --hnsw option after installing hnsw python binding.

$ imgdupes -rdc --hnsw 101_ObjectCategories phash 4

Fast exact searching

imgdupes supports exact nearest neighbor search of hamming distance using faiss (IndexFlatL2).

To dedupe images using faiss, run with --faiss-flat option after installing faiss python binding.

$ imgdupes -rdc --faiss-flat 101_ObjectCategories phash 4

Using imgdupes without installing it with docker

You can use imgdupes without installing it using a pre-build docker container image.
NGT, hnsw and faiss are already installed in this image.

Place the target directory in the current directory and execute the following command.

$ docker run -it -v $PWD:/app knjcode/imgdupes -rdc target_dir phash 0

When docker run, current directory is mounted inside the container and referenced from imgdupes.

By aliasing the command, you can use imgdupes as installed.

$ alias imgdupes="docker run -it -v $PWD:/app knjcode/imgdupes"
$ imgdupes -rdc target_dir phash 0

To upgrade imgdupes docker image, you can pull the docker image as below.

$ docker pull knjcode/imgdupes

Available hash algorithm

imgdupes uses the ImageHash to calculate perceptual hash (except for phash_org algorithm).

  • ahash: average hashing

  • phash: perception hashing (using only the 8x8 DCT low-frequency values including the first term)

  • dhash: difference hashing

  • whash: wavelet hashing

  • phash_org: perception hashing (fix algorithm from ImageHash implementation)

    using only the 8x8 DCT low-frequency values and excluding the first term since the DC coefficient can be significantly different from the other values and will throw off the average.

Options

-r --recursive

search images recursively from the target directory (default=False)

-d --delete

prompt user for files to preserve and delete (default=False)

-c --imgcat

display duplicate images for iTerm2 (default=False)

-m --summarize

summarize dupe information

-N --noprompt

together with --delete, preserve the first file in each set of duplicates and delete the rest without prompting the user

--query <image filename>

find image files that are duplicated or similar to the specified image file from the target directory

--hash-bits 64

bits of perceptual hash (default=64)

The number of bits specifies the value that is the square of n.
For example, you can specify 64(8^2), 144(12^2), 256(16^2), etc.

--sort <sort_type>

how to sort duplicate image files (default=filesize)

You can specify following types:

  • filesize: sort by filesize in descending order
  • filepath: sort by filepath in ascending order
  • imagesize: sort by pixel width and height in descenging order
  • width: sort by pixel width in descending order
  • height: sort by pixel height in descending order
  • none: do not sort

--reverse

reverse sort order

--num-proc 4

number of hash calculation and ngt processes (default=cpu_count-1)

--log

output logs of duplicate and delete files (default=False)

--no-cache

not create or use image hash cache (default=False)

--no-subdir-warning

stop warnings that appear when similar images are in different subdirectories

--sameline

list each set of matches on a single line

--dry-run

dry run (do not delete any files)

--faiss-flat

use faiss exact search (IndexFlatL2) for calculating Hamming distance between hash of images (default=False)

--faiss-flat-k 20

number of searched objects when using faiss-flat (default=20)

use with imgcat (-c, --imgcat) options

--size 256x256

resize image (default=256x256)

--space 0

space between images (default=0)

--space-color black

space color between images (default=black)

--tile-num 4

horizontal tile number (default=4)

--interpolation INTER_LINEAR

interpolation methods (default=INTER_LINEAR)

You can specify OpenCV interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_AREA, INTER_CUBIC, INTER_LANCZOS4, etc.

--no-keep-aspect

do not keep aspect when displaying images

ngt options

--ngt

use NGT for calculating Hamming distance between hash of images (default=True)

--ngt-k 20

number of searched objects when using NGT. Increasing this value, improves accuracy and increases computation time. (default=20)

--ngt-epsilon 0.1

search range when using NGT. Increasing this value, improves accuracy and increases computation time. (default=0.1)

--ngt-edges 10

number of initial edges of each node at graph generation time. (default=10)

--ngt-edges-for-search 40

number of edges at search time. (default=40)

hnsw options

--hnsw

use hnsw for calculating Hamming distance between hash of images (default=False)

--hnsw-k 20

number of searched objects when using hnsw. Increasing this value, improves accuracy and increases computation time. (default=20)

--hnsw-ef-construction 100

controls index search speed/build speed tradeoff (default=100)

--hnsw-m 16

m is tightly connected with internal dimensionality of the data stronlgy affects the memory consumption (default=16)

--hnsw-ef 50

controls recall. higher ef leads to better accuracy, but slower search (default=50)

faiss options

--faiss-cuda

uses CUDA enabled device for faster searching (requires faiss-gpu, Nvidia GPU, and CUDA toolkit)
Install: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
General: https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU

CUDA options

--cuda-device

uses the specific CUDA device passed for CUDA accelerated searches (default=device with lowest load)
NOTE: if the device passed is not found on the system the CUDA device with the lowest load will be used

License

MIT

imgdupes's People

Contributors

knjcode avatar sam-ulrich1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

imgdupes's Issues

error on install

pip install imgdupes
Collecting imgdupes
  Downloading imgdupes-0.1.1.tar.gz (11 kB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Requirement already satisfied: ImageHash in c:\users\realh\appdata\local\programs\python\python38\lib\site-packages (from imgdupes) (3.4)
Requirement already satisfied: joblib in c:\users\realh\appdata\local\programs\python\python38\lib\site-packages (from imgdupes) (0.16.0)
ERROR: Could not find a version that satisfies the requirement ngt (from imgdupes) (from versions: none)
ERROR: No matching distribution found for ngt (from imgdupes)

Find highest quality duplicated instead of removing duplicates

I would like to do something similar to whats described in the docs, but instead of deleting duplicate files, I would like to search for duplicates(from a set a query images), find the duplicate with the highest quality, and copy that duplicate to a new folder

bash: imgdupes: command not found

I installed imgdupes with pip for both python2 and python3, I get the same error when I run
imgdupes --recursive target_dir phash 4

Compare with damaged JPGs

I have a huge library of files taken out with different softwares from a broken hard disk, I managed to salvage most of my photo library, but I discovered I have many corrupt copies of many files, they are exactly the same as my not corrupted versions for the first few lines and then they get cut or they start huge glitches.

The thing is that most photo duplicates removal softwares are actually using image hashing algorithms, and while they work wonderfully comparing "edited" images they don't work at all comparing corrupted files.

I don't know if that would be possible but I was wondering if there's any chance to add some sort of "pixel stream" comparison, like looking at differences not in a "perceptual hash" but treating pixels as some sort of string declaring color values starting from the top-left corner.

Thanks in advance :)

RuntimeError: src/ngtpy.cpp

running in python 3.9 get follow error:

$ imgdupes --recursive datasets phash 4
Building NGT index (dimension=64, num_proc=15)
Traceback (most recent call last):
File "/home/huzhuolei/miniconda3/envs/imgdupes/bin/imgdupes", line 230, in
main()
File "/home/huzhuolei/miniconda3/envs/imgdupes/bin/imgdupes", line 226, in main
dedupe_images(args)
File "/home/huzhuolei/miniconda3/envs/imgdupes/bin/imgdupes", line 94, in dedupe_images
deduper.dedupe(args)
File "/home/huzhuolei/miniconda3/envs/imgdupes/lib/python3.9/site-packages/common/imagededuper.py", line 172, in dedupe
ngt_index.batch_insert(self.hashcache.hshs(), num_proc)
RuntimeError: src/ngtpy.cpp:

"--query" option does not work if the specified image is not contained in the target_dir

directory like this:

directory1/
......├── 1.png
......└── directory2
.........................└── 1.copy.png

If I cd to directory1/ then run:
imgdupes -r . --query 1.png dhash 4

It would show the result:

Searching similar images
100%|████████████| 2/2 [00:00<00:00, 15335.66it/s]
Query: 1.png

1.png
directory2/1.copy.png

However, if I run:
imgdupes -r directory2 --query 1.png dhash 4

It would fail to find out the 1.copy.png :

Searching similar images
100%|████████████| 1/1 [00:00<00:00, 3342.07it/s]
Query: 1.png

Only --faiss-flat works in docker on Synology NAS???

On my Synology NAS (DS720+) I installed docker and tried to run imgdupes on a folder with two exact same images. It works only for --faiss-flat while I never get any output or result for --ngt or --hnsw, no matter what other options, values or images I provide.

admin@nas2:/volume1/docker$ ll
total 600
-rwxrwxrwx 1 admin users 304184 Apr  1  2013 test1.jpg
-rwxrwxrwx 1 admin users 304184 Apr  1  2013 test2.jpg
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes . phash 0
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes --ngt . phash 0
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes --hnsw . phash 0
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes --faiss-flat . phash 0
Building faiss index (dimension=64, num_proc=3)
Exact neighbor searching using faiss
100%|████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2545.09it/s]
test1.jpg
test2.jpg

admin@nas2:/volume1/docker$

What could cause this issue? Am I doing something wrong? Is it a bug?

If I cannot fix it, what are the pros and cons of --faiss-flat in contrast to --ngt or --hnsw?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.