knjcode / imgdupes Goto Github PK
View Code? Open in Web Editor NEWIdentifying and removing near-duplicate images using perceptual hashing.
Identifying and removing near-duplicate images using perceptual hashing.
pip install imgdupes
Collecting imgdupes
Downloading imgdupes-0.1.1.tar.gz (11 kB)
Collecting future
Using cached future-0.18.2.tar.gz (829 kB)
Requirement already satisfied: ImageHash in c:\users\realh\appdata\local\programs\python\python38\lib\site-packages (from imgdupes) (3.4)
Requirement already satisfied: joblib in c:\users\realh\appdata\local\programs\python\python38\lib\site-packages (from imgdupes) (0.16.0)
ERROR: Could not find a version that satisfies the requirement ngt (from imgdupes) (from versions: none)
ERROR: No matching distribution found for ngt (from imgdupes)
I have a huge library of files taken out with different softwares from a broken hard disk, I managed to salvage most of my photo library, but I discovered I have many corrupt copies of many files, they are exactly the same as my not corrupted versions for the first few lines and then they get cut or they start huge glitches.
The thing is that most photo duplicates removal softwares are actually using image hashing algorithms, and while they work wonderfully comparing "edited" images they don't work at all comparing corrupted files.
I don't know if that would be possible but I was wondering if there's any chance to add some sort of "pixel stream" comparison, like looking at differences not in a "perceptual hash" but treating pixels as some sort of string declaring color values starting from the top-left corner.
Thanks in advance :)
The HEIF format keeps growing in use.
Would be nice to be able to quickly check in a script or CI if dupes are found by checking the exit code
I can zoom in on my terminal, but then when the next set of images comes they're resized back to the dimensions they were before, with the text remaining large.
I continue getting this error:
Error: Unable to load NGT. Please install NGT and python binding first.
After I have installed NGT multiple times again...
Not sure what the problem is, all help is much appreciated.
On my Synology NAS (DS720+) I installed docker and tried to run imgdupes on a folder with two exact same images. It works only for --faiss-flat
while I never get any output or result for --ngt
or --hnsw
, no matter what other options, values or images I provide.
admin@nas2:/volume1/docker$ ll
total 600
-rwxrwxrwx 1 admin users 304184 Apr 1 2013 test1.jpg
-rwxrwxrwx 1 admin users 304184 Apr 1 2013 test2.jpg
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes . phash 0
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes --ngt . phash 0
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes --hnsw . phash 0
admin@nas2:/volume1/docker$ sudo docker run -it -v $PWD:/app knjcode/imgdupes --faiss-flat . phash 0
Building faiss index (dimension=64, num_proc=3)
Exact neighbor searching using faiss
100%|████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2545.09it/s]
test1.jpg
test2.jpg
admin@nas2:/volume1/docker$
What could cause this issue? Am I doing something wrong? Is it a bug?
If I cannot fix it, what are the pros and cons of --faiss-flat
in contrast to --ngt
or --hnsw
?
running in python 3.9 get follow error:
$ imgdupes --recursive datasets phash 4
Building NGT index (dimension=64, num_proc=15)
Traceback (most recent call last):
File "/home/huzhuolei/miniconda3/envs/imgdupes/bin/imgdupes", line 230, in
main()
File "/home/huzhuolei/miniconda3/envs/imgdupes/bin/imgdupes", line 226, in main
dedupe_images(args)
File "/home/huzhuolei/miniconda3/envs/imgdupes/bin/imgdupes", line 94, in dedupe_images
deduper.dedupe(args)
File "/home/huzhuolei/miniconda3/envs/imgdupes/lib/python3.9/site-packages/common/imagededuper.py", line 172, in dedupe
ngt_index.batch_insert(self.hashcache.hshs(), num_proc)
RuntimeError: src/ngtpy.cpp:
I installed imgdupes with pip for both python2 and python3, I get the same error when I run
imgdupes --recursive target_dir phash 4
Hello.
I would love to run this thing on my Synology-NAS (DS718+). Unfortunately, my containers always exit with error 132:
https://medium.com/@nprch_12/docker-exited-132-e38f9dd2cd0d
This is weird as my CPU should support SSE4.2.
cpuinfo.txt
Would you mind helping me in debugging this issue? Unfortunately I don't get any logs from the container as it crashes immediately.
Best
For example, delete these images if they are 60% similar but keep if they are 59% similar.
directory like this:
directory1/
......├── 1.png
......└── directory2
.........................└── 1.copy.png
If I cd to directory1/ then run:
imgdupes -r . --query 1.png dhash 4
It would show the result:
Searching similar images
100%|████████████| 2/2 [00:00<00:00, 15335.66it/s]
Query: 1.png
1.png
directory2/1.copy.png
However, if I run:
imgdupes -r directory2 --query 1.png dhash 4
It would fail to find out the 1.copy.png :
Searching similar images
100%|████████████| 1/1 [00:00<00:00, 3342.07it/s]
Query: 1.png
I would like to do something similar to whats described in the docs, but instead of deleting duplicate files, I would like to search for duplicates(from a set a query images), find the duplicate with the highest quality, and copy that duplicate to a new folder
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.