Coder Social home page Coder Social logo

Too many collisions? about imagehash HOT 28 OPEN

jenssegers avatar jenssegers commented on July 17, 2024
Too many collisions?

from imagehash.

Comments (28)

doughboyks avatar doughboyks commented on July 17, 2024 2

I think I have this figured out. I made the MySQL field BigINT Unsigned, I installed bcmath(so I don't get E+19 values) and I am storing and comparing against the hexdec value of the hash. I did a limited test and it got an exact match against one image (the others were in the 20's). I did another test where there were two similar images (one color and one black and white) and it matched perfect on the color and was like 10 on the B&W).

Looks like it will take 80+ hours before it will have hashed everything. I will do some more testing and see if it all is working and if so we can consider this "issue" closed (just might want to take the learnings and apply to the readme.

from imagehash.

Lucassifoni avatar Lucassifoni commented on July 17, 2024 1

@scratcher28 the finding duplicates part was good enough, we supplemented it with some heuristics as a duplicate flagging pipeline :

  • Does it have the same sha256 hash as another one ? Exit, use a reference to the previous one without telling the user.
  • Does this image have the same filename than another one ? Flag (weight by filename length)
  • Does this image have a hash (as given by this library) close to another one ? Flag
  • Is the user who uploads it the same as the user who uploaded some of those flagged images ? Flag

Depending on the number of flags, you could show a popup to the user to ask for confirmation and word it more or less strongly. If the user says "yes, they're the same", we keep the one that had a better resolution.

from imagehash.

jenssegers avatar jenssegers commented on July 17, 2024

That's weird. Can you make sure you are comparing integers, and not strings?

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

I am not following. I have both of those hashes stored in the database just like that. I could not find out how to get and store the 01010101011 representation.

So the query actually would look like

SELECT c.*, BIT_COUNT('12e627593dbc2307' ^ '12d2552c66ddc94b') as hamming_distance
FROM images i
where hash is not null
ORDER BY hamming_distance ASC

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

Following up on this. What am I doing wrong or does it not work for what I want it to?

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

Any help on this would be greatly appreciated. Either I am not following how to get the binary value of the hash or what I am doing is correct and it has limitations.

from imagehash.

Opekunov avatar Opekunov commented on July 17, 2024

Sorry for my Eng.
First -
_MySQL BIT_COUNT() syntax:
BIT_COUNT(N)
Where N is an integer. (Not string)
Not sure, but it seems to me that the 01010101011 is also not a string

Why you can't loaded all md5 hashes from DB to script, and use
$yourVar->distance($hash1,$hash2)?
My test function (return 10 closest hashes) with 50000 hashes - result 0.8 sec (load from db -> foreach -> asort -> array_spice)

with your test pics:
$yourVar->distance('12d2552c66ddc94b','12e627593dbc2307') //return 28

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

Because by the time I get done hashing all of the images there will be 7 times as many as yours. All things being equal that would be more than 5 seconds.

So I need the MySQL query to work, I am just seeing no documentation on how to store and ultimately compare the binary version of a hash. For instance is there a "$hash->get_binary"? I could then store that and pass it into the query.

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

Let me ask this. Is the 0101011100 just an example and not the actual format? Following through the code for distance and when I call it passing the hash I get:

hash: 7e426e403d1e71f | hexdec: 568622213714011935

Should I be storing and using this hexdec value (even though it isn't 0's and 1's)?

from imagehash.

Opekunov avatar Opekunov commented on July 17, 2024

Ok, try with hexdec
$hash1 = hexdec ("12e627593dbc2307"); $hash2 = hexdec ("12d2552c66ddc94b");
SELECT *, BIT_COUNT(1361819201567466247 ^ hashBig) as hd from photoHash ORDER BY hd ASC;
Result 28, like $->distance($hash1,$hash2);
screenshot

But maybe need test with more images
One more faster solution https://stackoverflow.com/questions/4777070/hamming-distance-on-binary-strings-in-sql/4783415

from imagehash.

Opekunov avatar Opekunov commented on July 17, 2024

ALTER TABLE photoHash ADD hashBit BIT(YOURCOUNT)

for storing 01010101011 representation, but this not INT for BIT_COUNT

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

From the readme it has:

Image 1 hash: 3c3e0e1a3a1e1e1e (0011110000111110000011100001101000111010000111100001111000011110)
Image 2 hash: 3c3e0e3e3e1e1e1e (0011110000111110000011100011111000111110000111100001111000011110)

How do I generate the information in the parenthesis? I have the hash, just need to know how to get it as all 0's and 1's.

from imagehash.

jdreesen avatar jdreesen commented on July 17, 2024

In PHP you could do decbin(hexdec($hash)) or base_convert($hash, 16, 2) to convert the hex value into a binary value.

from imagehash.

doughboyks avatar doughboyks commented on July 17, 2024

Thanks jdreesen. So now I am attempting to store that data, but what format? The only thing that appears to keep it like this is varchar. But when I do the query like this:

SELECT i.*, BIT_COUNT(101001010110000000110011011011010010101111101100011010100001101 ^ i.hash) as hamming_distance
FROM images ci
where hash is not null
ORDER BY hamming_distance ASC

After processing only 10 images and grabbing one of the hashes all 10 are being returned as 1. Obviously I am not storing it correctly or I am not comparing correctly, or something. I apologize that I am having this much difficulty with this, but other than getting a hash none of this is really defined.

from imagehash.

Lucassifoni avatar Lucassifoni commented on July 17, 2024

Hi @doughboyks, could you elaborate on your use of bcmath to solve this ? My hashes int representation get truncated to 9223372036854775807 aka PHP_INT_MAX when I try to store them.

I'm considering splitting the hash in two parts for now, like

8ececed9d8d8d0f4 => 8ececed9, d8d8d0f4
to decimal => 2395918041, 3638087924

And then querying potential duplicates this way, assuming hash_1 and hash_2 being BIGINT columns, and supplied parameters being [int image.hash_1, int image.hash_2, int treshold].
(note, MySQL dialect)

select ID, BIT_COUNT(hash_1 XOR ?) + BIT_COUNT(hash2 XOR ?) AS distance FROM pictures HAVING distance < ?

But this really feels dirty, I'd like to avoid this approach.
Still, it works for now as a POC for my project, and I'm getting the same distances with @jenssegers compare() function :

  • phash is the whole hex string,
  • hash_1 an int representation of the first 8 digits of this hex string,
  • hash_2 an int representation of the last 8 digits of the hex string.
    /* Picture.php, Laravel5.5 Eloquent Model */

    /**
     * @param int $level
     * Returns potential duplicates of this picture.
     */
    public function findPotentialDuplicates(int $level = 5) {
        $hash_1 = $this->hash_1;
        $hash_2 = $this->hash_2;
        return DB::table('pictures')
            ->selectRaw('id, BIT_COUNT(hash_1 ^ ?) + BIT_COUNT(hash_2 ^ ?)  AS distance', [$h1, $h2])
            ->havingRaw('distance < ?', [$level])
            ->get();
    }

   /**
     * @param Picture $a
     * @param Picture $b
     * @return int
     */
    public static function compare(Picture $a, Picture $b) {
        $hasher = self::getHasher();
        return $hasher->distance($a->phash, $b->phash);
    }

But hey, it works for now, I'm getting similar and "opposite" images with similarity cursors, couldn't get fancier ! Thanks for documenting your issue, it allowed me to resolve this.

from imagehash.

scratcher28 avatar scratcher28 commented on July 17, 2024

@Lucassifoni, any progress on this issue?

from imagehash.

Lucassifoni avatar Lucassifoni commented on July 17, 2024

@scratcher28, splitting the hash in two columns worked fine with our approx. 120.000 images, with relatively fast queries on a modest server (meaning : it wasn't annoying to use), but we moved on and this system isn't in use anymore.

Re-reading this after a year, my approach was naïve and I should have designed something better.

from imagehash.

scratcher28 avatar scratcher28 commented on July 17, 2024

@Lucassifoni, do you know/have any better alternatives to find duplicates?

from imagehash.

DynamoGeek avatar DynamoGeek commented on July 17, 2024

With the conversation on this having hit on performance, I wanted to add that I find performance for hash comparisons quite bearable if you have GMP installed. Personally, I'm going to avoid doing any fancy database shenanigans as it'd lock me into a particular storage mechanism.

For 50,000 hash comparisons:
With GMP enabled - ~0.23 seconds
Without GMP enabled - ~1.54 seconds

While the exact times will vary depending on server performance, hashes, the position of the moon, and any number of other things, I strongly suspect the 80% reduction in execution time will be pretty consistent.

from imagehash.

garygreen avatar garygreen commented on July 17, 2024

For optimizing mysql query checkout this stackoverflow answer. It suggests to narrow down search results by adding all the bits together and caching that value in a seperate column - that will allow you to exclude checking images that are obviously not alike.

from imagehash.

aacook avatar aacook commented on July 17, 2024

I agree it would be helpful for some of this to exist in the install instructions.

To solve for PHP displaying numbers in E notation, I ended up making this work by installing bcmath. On Ubuntu:
$ sudo apt update
$ sudo apt install php7.4-bcmath

To solve for numbers being rounded to 9223372036854775807 in the database, make sure your column is bigint unsigned.

bcmath causes issues with number display (carrying what should be integers out to a crazy number of decimals). You might also find these functions helpful, in particular bcround()

{
    return strpos($n, '-') === 0; // Is the number less than 0?
}

function bcceil($n)
{
    return bcnegative($n) ? (($v = bcfloor(substr($n, 1))) ? "-$v" : $v)
                          : bcadd(strtok($n, '.'), strtok('.') != 0);
}

function bcfloor($n)
{
    return bcnegative($n) ? '-' . bcceil(substr($n, 1)) : strtok($n, '.');
}

function bcround($n, $p = 0)
{
    $e = bcpow(10, $p + 1);
    return bcdiv(bcadd(bcmul($n, $e, 0), bcnegative($n) ? -5 : 5), $e, $p);
}

from imagehash.

VincentChalnot avatar VincentChalnot commented on July 17, 2024

I have tested the perceptual hash with more than 5000 images, it occasionally generate weird results, but only with very high distance.
In my experience, anything above 8 can collide, but I still find actually similar images at around 15.
Below 8 it's generally very accurate though.
First collision is at 2 but for pictures of papers (invoices, tickets) so it's pretty normal considering these kind of documents have very few high frequencies variations and are after all, very similar.
First real collision is at 5 for pictures actually looking similar:
image
A very WTF collision at 5:
image
There are some weird collisions around 6-7 but most of the time it's really accurate.

What I don't understand though, is how it can find really similar pictures at very high distances:
image
The most WTF true positives at 15 distances:
image

I have absolutely no explanation for this behaviour, but I will try to merge these PR and test it again:

  • Not always returning a 16 char hash #28
  • Add PerceptualHash2 #58

from imagehash.

garygreen avatar garygreen commented on July 17, 2024

@VincentChalnot thank you so much for detailed information. We are also experiencing the same issue, although we use DifferenceHash. 8 seems to be the sweet spot for us too. How strange, maybe that can be my lottery numbers this weekend 🤣

from imagehash.

jenssegers avatar jenssegers commented on July 17, 2024

There's something wrong with the included perceptual hash algorithm. I could not figure out what exactly. The other included algorithms will usually provide better results.

from imagehash.

garygreen avatar garygreen commented on July 17, 2024

@jenssegers I get similar results even with DifferenceHash algorithm though. So maybe there's something else happening?

from imagehash.

VincentChalnot avatar VincentChalnot commented on July 17, 2024

Check the discussion at #52 and the PR #58 that are resolving a lot of issues. #52 (comment)

Things I will try to look into to detect more duplicates (not really related to pHash)

  • Super easy: Try to use exif data like creation date, photos with identical creation dates are likely to be exact duplicates and pictures with close creation dates can probably be analysed to keep only one.
  • Relatively easy: Try the PR concerning auto-rotation #17 (needs to move to a separate lib IMO)
  • Super hard: Try to center the picture on the main "subject", allowing to discard a percentage of the image around, I know there are library to do that. Compute only the hash of the cropped image.

from imagehash.

VincentChalnot avatar VincentChalnot commented on July 17, 2024

For those interested, here's my final query to select duplicates:

  SELECT m.ulid                                                 AS media_ulid,                                                 
         sm.ulid                                                AS similar_media_ulid,                                         
         BIT_COUNT(m.perceptual_hash ^ sm.perceptual_hash)      AS distance,                                                   
         BIT_COUNT(m.perceptual_hash ^ sm.perceptual_hash) + (                                                                                                          
             322970 - 322980/(1 + POWER(ABS(TIMESTAMPDIFF(MINUTE, m.creation_date, sm.creation_date))/3695643000, 0.55))       
          ) AS adjusted_distance                                                                                               
  FROM media AS m                                                                                                              
           INNER JOIN media AS sm ON m.ulid != sm.ulid                                                                         
           LEFT JOIN similarity s1 on m.ulid = s1.media_ulid                                                                   
  WHERE (m.last_similarity_query_date IS NULL OR sm.update_date > m.last_similarity_query_date)                                
    AND s1.media_ulid IS NULL                                                                                                                                                                                                           
  HAVING distance < 17;

UPDATE media SET last_similarity_query_date = NOW();

It's storing all duplicates into a table in a single query, adding an adjusted distance that adds a modifier on the pHash distance with a curve looking like this:
image
So images took at 20 minutes interval will have a stronger similarity and images took at more than 20min will have less similarities (the rest of the curve is almost linear)

Also, it ignores existing duplicates and it's storing a date on the media that were processed to ignore them during next query (not sure this is useful as the query takes less than 10s for 5000+ pictures).

from imagehash.

dddkkk01 avatar dddkkk01 commented on July 17, 2024

I tried to save the hash (toInt) into mysql, and I found the value were casted which max length is 8 byte(64 bits), so the distance calculated by mysql bit_count is not correct actually. We try to split the hash(toHex) into 2 parts and save them into two columns, add both distances of column as the real distance, sound weird ,but it works! if MySQL could XOR more than 64 bits, we can save the image hash as what we calculate in PHP

from imagehash.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.