Comments (31)
i need a vacation 🤣
from image-match.
Ok, I guess this discussion is going a little outside of the scope of that particular issue :)
I can work on filing a PR, which will remove signatures from the index, and will use the ES scores when bringing back the results. I could also impose an optional score cutoff (10, let's say), which could be overridden by a param.
Do you think that's reasonable?
from image-match.
@rhsimplex I filed a PR. I kept dist
for the non-elasticsearch implementation. I guess we'd also need to update the docs at some point.
from image-match.
docs.count | store.size
7488880 | 31.2gb
8316821 | 11.9gb
These are two indexes I have now. Second index is missing signatures. Pretty impressive results.
However, I will keep looking into optimizing the space, it's still too much. With this improvement we're at 1.2TB for 800M images. I'll try to reduce it more.
from image-match.
I'm having a look at the paper . Would be great if I could draw some insights from it . Will get back to you ! 👍
from image-match.
One idea: at the moment, signatures are stored as big int arrays, e.g.:
[-1, -2, 0, 0, 1, ... , 4, -1]
And inserted directly into the database. This takes a lot of space on disk. A better way might be to convert them to strings. For instance, if the signature (of length n
) values run from -2..+2, you could make them all non-negative by adding 2:
[1, 0, 2, 2, 3, ... , 4, 1]
You can convert this into a number by multiplying out powers of 5..basically using base 5:
1*5^0 + 0*5^1 + 2*5^2 + 2*5^3 + 3*5^4 ... 4*5^(n-1), 1*5^n = T
By this construction, T is unique. So now you will have a very large integer. You can then compress it by factoring it in a larger base. Say you take all the digits 0-9
and letters a..z
. That's 10 digits + 26 letters = base 36. How many characters will you need?
The maximum value is 5^(n+1) - 1
, (if all coefficients are 4, check my math here), so guaranteeing a base 36 encoding is bigger requires 36^(m+1) - 1 >= 5^(n+1) - 1
.
Solving for m (again, check my math), m >= log_36(5^(n+1))
. As you can see, the formula is general (it doesn't matter if use base 5 and 36) and m
is how long your signature will have to be. If n=648
, the default, then:
import math
math.log(5**648+1, 36)
Gives 291.03118615207245
so we need a 292-character string. If you use upper and lowercase, you can get it down to 253. Still long, but probably less space than a big array of ints (should check this first though!).
To get it back to an array, you have to do this whole process in reverse.
from image-match.
Is doing so much computation to conserve a few bytes worth it ?(maybe im wrong) In my opinion, if you are looking at reducing the signature size, that process has to take place before you add the values to the big int array and not after generating the values . In your approach , we are considering that we have access to a signature array . That shouldn't be the case , right ? Because then we'd have array + string in memory . Please correct me if I'm wrong .
Maybe developing a method wherein values are somehow encoded/compressed while they are being generated may help !?
from image-match.
The space savings are not insignificant, especially if you are storing billions of records. For context, I did a little experiment with 100 images and made three different indexes -- tester
which is normal, tester_str
where i replaced the signature with a length-253 str, and tester_none
where I removed the signature completely:
http://localhost:9200/tester,tester_str,tester_none/_stats/store
"indices": {
"tester_str": {
"primaries": {
"store": {
"size_in_bytes": 702295,
"throttle_time_in_millis": 0
}
},
"total": {
"store": {
"size_in_bytes": 702295,
"throttle_time_in_millis": 0
}
}
},
"tester": {
"primaries": {
"store": {
"size_in_bytes": 905047,
"throttle_time_in_millis": 0
}
},
"total": {
"store": {
"size_in_bytes": 905047,
"throttle_time_in_millis": 0
}
}
},
"tester_none": {
"primaries": {
"store": {
"size_in_bytes": 647291,
"throttle_time_in_millis": 0
}
},
"total": {
"store": {
"size_in_bytes": 647291,
"throttle_time_in_millis": 0
}
}
}
}
from image-match.
Another (complementary) route, is to remove the extraneous zeros from the signature -- any grid point on an edge with missing neighbors gets a zero for the difference. These are completely useless and waste space...which is my lazy engineering =)
from image-match.
Any plans on addressing this concern? It feels like with current approach, an index with a billion images may weight hundreds of G's.
from image-match.
I got an index with 200k images, which is 710.4mb (elasticsearch index size). So for only 2M records, I imagine this size will be roughly around 7G. So speaking of a scale of billions, 2B records would be around... uhm... 7TB? That sounds really scary.
Any thoughts?
from image-match.
Hi @ecdeveloper,
Yes, the signatures are pretty heavy and your ballpark figures are correct. Disk space was not a concern for us when we developed this, so we didn't put much effort into making it more efficient.
I don't have time to work on new image-match features, but if someone wants to work on a PR improving signature compression, I'll happily support them. Probably the easiest way is just to modify the code to not store the signature at all -- and just return images that match on the most words.
from image-match.
Thank you for your reply. I'd totally be happy to contribute to this amazing project.
I already started looking into the source code, and I was curious - is signature, stored in the db, used at all? From what I saw - you currently match images by words. So I wonder what was the purpose of storing those signatures.
Thanks.
from image-match.
That's exactly right. With the Elasticsearch implementation, the actual image distances are only used to sort the final results. This could easily be done with the Elasticsearch match
score. So it's really not necessary to store the signatures.
Want to give it a shot?
from image-match.
Yes, for sure. I suppose I'll need to tweak this query object, right?
https://github.com/ascribe/image-match/blob/master/image_match/elasticsearch_driver.py#L57
from image-match.
@rhsimplex I created a quick prototype locally, which doesn't use signatures and distances, and returns only those results where score is greater than 0. But there is one problem with that approach. Currently, you can pass distance_cutoff
as an argument to SignatureES()
. If we rely on the score instead of distance - distance_cutoff
won't have any effect.
Do you have any thoughts on how to deal with that issue?
from image-match.
You shouldn't need to alter the database query. SignatureES
overrides the __init__
method. You can have it explicitly take a dist
argument. Make the default None
. Then, pass the base constructor a dummy value (or nothing, it will just take the default) and make sure to conditinally change every line in SignatureES
that uses self.distance_cuttoff
i.e. if dist is None
skip this statement and sort by score instead of dist here. You might need another parameter like score_cutoff
in the constructor.
from image-match.
Makes sense.
Running more tests, though, makes me feel that relying on the score doesn't sound like a great idea.
I indexed one image, then I searched for a bunch of similar images, and here are the results I got (providing the score
and dist
):
[{ 'score': 1.0, 'dist': 0.73803269432014051, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
--
[{'score': 10.0, 'dist': 0.3448189311714156, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
--
[{'score': 15.0, 'dist': 0.38898318088094741, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
--
[{'score': 12.0, 'dist': 0.47407832345730783, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
According to docs, higher scores are better. However, here we see in 2nd case score is 10 and dist is 0.3, whereas case 4 - score is 12 (higher then 2nd one) and dist is 0.47 (which is no match).
from image-match.
Actually this looks good to me. As long as we choose the default threshold to still get most of the positive results (high recall) it's ok to have worse specificity. Especially since the tradeoff involves such a huge disk space savings. It was unlikely distance would exactly track score anyway.
0.45
was chosen rather arbitrarily, based on the kinds of images I happened to develop this project for. You'll see in the original paper they suggest different thresholds.
In fact, in the reference implementation they often don't compute the distance at all, which lends some credibility to our approach. If you are willing to look into their code, there may be some more tricks too.
from image-match.
Ah, you are right. After comparing those images by looking at them, the one with distance 0.47
and score of 12.0
- is very similar to the indexed one. So in this case score
is even showing a better match than dist
. Do you have any suggestions on the default (recommended) threshold for score
?
from image-match.
Something between 1 and 10? 🤣 Seriously though, just pick something in that range. If you really want to fix narrower bounds on value, you can look at the tests and see which images should and shouldn't match. By the way, Elasticsearch uses TF/IDF scoring, if you're curious what the score actually means.
from image-match.
Now I'm slightly confused 😕 I ran lots of tests with different images. I noticed that I always get elasticsearch scores, ranging 1 to 63. 63 is usually 1 to 1 match (normally when I search for the exact image I indexed). 1.0
means no match. After running all those tests, I figured that score of 20.0
to 60.0
means almost identical image match. 10.0
to 20.0
- very similar. Anything under 9.0
is either a somewhat match or no match at all. May this be related to a different elasticsearch version? I'm using Elasticsearch 5.3.0
with Lucene 6.4.1
.
from image-match.
There are 63 words available for search. The Tf/Idf top term means all those words are an exact match, and the Idf means that every word is unique in the corpus. I believe 63 is the highest possible score then, and as you add more images and repeat the search, the number will likely go down.
So you are justified being confused, because my explanation was bad! So how about this -- ignore the cutoff completely, and just return the top size
results?
from image-match.
I could definitely do that, but I see one issue with that approach - let's say there is no match in the index for the image I'm searching by. However, there are slightly similar images. So the top result would have the score of, let's say, 7. In my case this is a bit of a match, but in fact the images are completely different.
from image-match.
That's ok. It's a typical feature of "search engine"-like applications...some of the results are basically meaningless. But if you were doing a search, wouldn't you rather have a look at the marginal results as well?
Having a hard cutoff like we do now also has disadvantages, as you're seeing in #64
from image-match.
Yes go for it!
from image-match.
Even better than expected! You can try reducing the number of words (N
) and see how it affects your results.
from image-match.
Could that affect the results accuracy?
I also tried renaming the simple_word_*
to w_*
, this reduced the index size but very unsignificantly, so I reverted it back.
from image-match.
Yes it should strictly decrease accuracy, but I don't know by how much. The original paper used 100 words of length 10, so more and shorter words. I don't find their probabilistic argument that convincing, so might be worth a shot.
Another thing you can try is getting rid of the extraneous zeros. Signatures are generated by differences of nearest neighbors on a grid. All off-grid differences are just set to zero.
This is done here:
https://github.com/ascribe/image-match/blob/master/image_match/goldberg.py#L418
This function isn't terribly clear since it's written to be fast, but if you look through it you'll see that some parts of the returned array will always be zero. These always get baked into the signature but actually don't carry any information. If you're interested in trying this I can provide support, but this will be more difficult since the code is really tightly coupled here.
from image-match.
Oh, we're about to remove signatures from the index, remember? ;)
from image-match.
Hi there, I was reading through your discussion here and would like to bring it back to live .. We also want to store a large number of signatures in Elastic and are really interested in @ecdeveloper solution to save storage space and make use of the elastic scoring feature. Any chance, we get that merged?
We couldn't find any other solution than your project to find similar pictures.. which is pretty sad :D
from image-match.
Related Issues (20)
- skimage `imread` as_grey vs as_gray HOT 4
- Using Image-Match with Real-Time/Live/Video Stream? HOT 2
- Issue installing on python 3.7 conda env HOT 2
- cut_distance
- fails to pass a sanity check due to a bug in the windows runtime
- Failure when running this function through pytest
- Will the face search function be added?
- Mongo search is slow
- Is this still maintained? HOT 1
- New release compatible with Python 3.7
- Broken link to Pavlov in the readme
- Bulk insert of images on ElasticSearch
- Elasticsearch version conflict HOT 1
- Link to get 80M images is FORBIDDEN
- Fast Insert HOT 2
- setup.py fails when installing scikit-image everytime HOT 2
- Error on search_image HOT 3
- ValueError: the input array must have size 3 along `channel_axis`, got (923, 600) HOT 15
- Doesn't support Webp image format. HOT 1
- Whether to consider adding face search ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from image-match.