Coder Social home page Coder Social logo

image_search_engine's Introduction

Building a Deep Image Search Engine using tf.Keras

Motivation :

Imagine having a data collection of hundreds of thousands to millions of images without any metadata describing the content of each image. How can we build a system that is able to find a sub-set of those images that best answer a user’s search query ?
What we will basically need is a search engine that is able to rank image results given how well they correspond to the search query, which can be either expressed in a natural language or by another query image.
The way we will solve the problem in this post is by training a deep neural model that learns a fixed length representation (or embedding) of any input image and text and makes it so those representations are close in the euclidean space if the pairs text-image or image-image are “similar”.

Data set :

I could not find a data-set of search result ranking that is big enough but I was able to get this data-set : http://jmcauley.ucsd.edu/data/amazon/ which links E-commerce item images to their title and description. We will use this metadata as the supervision source to learn meaningful joined text-image representations. The experiments were limited to fashion (Clothing, Shoes and Jewelry) items and to 500,000 images in order to manage the computations and storage costs.

Problem setting :

The data-set we have links each image with a description written in natural language. So we define a task in which we want to learn a joined, fixed length representation for images and text so that each image representation is close to the representation of its description.

Model :

The model takes 3 inputs : The image (which is the anchor), the image title+description ( the positive example) and the third input is some randomly sampled text (the negative example).
Then we define two sub-models :

  • Image encoder : Resnet50 pre-trained on ImageNet+GlobalMaxpooling2D
  • Text encoder : GRU+GlobalMaxpooling1D

The image sub-model produces the embedding for the Anchor **E_a **and the text sub-model outputs the embedding for the positive title+description E_p and the embedding for the negative text E_n.

We then train by optimizing the following triplet loss:

L = max( d(E_a, E_p)-d(E_a, E_n)+alpha, 0)

Where d is the euclidean distance and alpha is a hyper parameter equal to 0.4 in this experiment.

Basically what this loss allows to do is to make **d(E_a, E_p) small and make d(E_a, E_n) **large, so that each image embedding is close to the embedding of its description and far from the embedding of random text.

Visualization Results :

Once we learned the image embedding model and text embedding model we can visualize them by projecting them into two dimensions using tsne (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html ).

Test Images and their corresponding text description are linked by green lines

We can see from the plot that generally, in the embedding space, images and their corresponding descriptions are close. Which is what we would expect given the training loss that was used.

Text-image Search :

Here we use few examples of text queries to search for the best matches in a set of 70,000 images. We compute the text embedding for the query and then the embedding for each image in the collection. We finally select the top 9 images which are the closest to the query in the embedding space.

These examples show that the embedding models are able to learn useful representations of images and embeddings of simple composition of words.

Image-Image Search :

Here we will use an image as a query and then search in the database of 70,000 images for the examples that are most similar to it. The ranking is determined by how close each pair of images are in the embedding space using the euclidean distance.

The results illustrate that the embeddings generated are high level representations of images that capture the most important characteristics of the objects represented without being excessively influenced by the orientation, lighting or minor local details, without being trained explicitly to do so.

Conclusion :

In this project we worked on the Machine learning blocks that allow us to build a keyword and image based search engine applied to a collection of images. The basic idea is to learn a meaningful and joined embedding function for text and image and then use the distance between items in the embedding space to rank search results.

References :

image_search_engine's People

Contributors

cvxtz avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

image_search_engine's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.