tensorflow / similarity Goto Github PK

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

License: Apache License 2.0

Python 100.00%

similarity-learning metric-learning similarity-search nearest-neighbor-search nearest-neighbors deep-learning tensorflow contrastive-learning machine-learning unsupervised-learning

similarity's Introduction

TensorFlow Similarity: Metric Learning for Humans

TensorFlow Similarity is a TensorFlow library for similarity learning which includes techniques such as self-supervised learning, metric learning, similarity learning, and contrastive learning. TensorFlow Similarity is still in beta and we may push breaking changes.

Introduction

Tensorflow Similarity offers state-of-the-art algorithms for metric learning along with all the necessary components to research, train, evaluate, and serve similarity and contrastive based models. These components include models, losses, metrics, samplers, visualizers, and indexing subsystems to make this quick and easy.

With Tensorflow Similarity you can train two main types of models:

Self-supervised models: Used to learn general data representations on unlabeled data to boost the accuracy of downstream tasks where you have few labels. For example, you can pre-train a model on a large number of unlabled images using one of the supported contrastive methods supported by TensorFlow Similarity, and then fine-tune it on a small labeled dataset to achieve higher accuracy. To get started training your own self-supervised model see this notebook.
Similarity models: Output embeddings that allow you to find and cluster similar examples such as images representing the same object within a large corpus of examples. For instance, as visible above, you can train a similarity model to find and cluster similar looking, unseen cat and dog images from the Oxford IIIT Pet Dataset while only training on a few of the dataset classes. To get started training your own similarity model see this notebook.

What's new

[Mar 2023]: 0.17 more losses and metric and massive refactoring
- Added VicReg Loss to contrastive losses.
- Added metrics used in retrieval papers such as Precision@K
- Native support for distributed training e.g SimClr now works correctly with distributed training.
- Multi-modal embedding initial support (CLIP)

For more details and previous releases information - see the changelog

Getting Started

Installation

Use pip to install the library.

NOTE: The Tensorflow extra_require key can be omitted if you already have tensorflow>=2.4 installed.

pip install --upgrade-strategy=only-if-needed tensorflow_similarity[tensorflow]

Documentation

The detailed and narrated notebooks are a good way to get started with TensorFlow Similarity. There is likely to be one that is similar to your data or your problem (if not, let us know). You can start working with the examples immediately in Google Colab by clicking the Google Colab icon.

For more information about specific functions, you can check the API documentation

For contributing to the project please check out the contribution guidelines

Minimal Example: MNIST similarity

Click to expand and see how to train a supervised similarity model on mnist using TF.Similarity

Here is a bare bones example demonstrating how to train a TensorFlow Similarity model on the MNIST data. This example illustrates some of the main components provided by TensorFlow Similarity and how they fit together. Please refer to the hello_world notebook for a more detailed introduction.

Preparing data

TensorFlow Similarity provides data samplers, for various dataset types, that balance the batches to ensure smoother training. In this example, we are using the multi-shot sampler that integrates directly from the TensorFlow dataset catalog.

from tensorflow_similarity.samplers import TFDatasetMultiShotMemorySampler

# Data sampler that generates balanced batches from MNIST dataset
sampler = TFDatasetMultiShotMemorySampler(dataset_name='mnist', classes_per_batch=10)

Building a Similarity model

Building a TensorFlow Similarity model is similar to building a standard Keras model, except the output layer is usually a MetricEmbedding() layer that enforces L2 normalization and the model is instantiated as a specialized subclass SimilarityModel() that supports additional functionality.

from tensorflow.keras import layers
from tensorflow_similarity.layers import MetricEmbedding
from tensorflow_similarity.models import SimilarityModel

# Build a Similarity model using standard Keras layers
inputs = layers.Input(shape=(28, 28, 1))
x = layers.experimental.preprocessing.Rescaling(1/255)(inputs)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.Flatten()(x)
x = layers.Dense(64, activation='relu')(x)
outputs = MetricEmbedding(64)(x)

# Build a specialized Similarity model
model = SimilarityModel(inputs, outputs)

Training model via contrastive learning

To output a metric embedding, that are searchable via approximate nearest neighbor search, the model needs to be trained using a similarity loss. Here we are using the MultiSimilarityLoss(), which is one of the most efficient loss functions.

from tensorflow_similarity.losses import MultiSimilarityLoss

# Train Similarity model using contrastive loss
model.compile('adam', loss=MultiSimilarityLoss())
model.fit(sampler, epochs=5)

Building images index and querying it

Once the model is trained, reference examples must be indexed via the model index API to be searchable. After indexing, you can use the model lookup API to search the index for the K most similar items.

from tensorflow_similarity.visualization import viz_neigbors_imgs

# Index 100 embedded MNIST examples to make them searchable
sx, sy = sampler.get_slice(0,100)
model.index(x=sx, y=sy, data=sx)

# Find the top 5 most similar indexed MNIST examples for a given example
qx, qy = sampler.get_slice(3713, 1)
nns = model.single_lookup(qx[0])

# Visualize the query example and its top 5 neighbors
viz_neigbors_imgs(qx[0], qy[0], nns)

Supported Algorithms

Self-Supervised Models

SimCLR
SimSiam
Barlow Twins

Supervised Losses

Triplet Loss
PN Loss
Multi Sim Loss
Circle Loss
Soft Nearest Neighbor Loss

Metrics

Tensorflow Similarity offers many of the most common metrics used for classification and retrieval evaluation. Including:

Name	Type	Description
Precision	Classification
Recall	Classification
F1 Score	Classification
Recall@K	Retrieval
Binary NDCG	Retrieval

Citing

Please cite this reference if you use any part of TensorFlow similarity in your research:

@article{EBSIM21,
  title={TensorFlow Similarity: A Usable, High-Performance Metric Learning Library},
  author={Elie Bursztein, James Long, Shun Lin, Owen Vallis, Francois Chollet},
  journal={Fixme},
  year={2021}
}

Disclaimer

This is not an official Google product.

similarity's People

Contributors

Stargazers

Watchers

Forkers

rellikjaeger tuanbc ai-hub-deep-learning-fundamental agrover112 stjordanis freakeinstein sts-sadr omar-fouad ssakhavi shashanklipate shaunstanislauslau techthiyanes bluseking tauseef-a fenghaolinroix yummyknight phillips96 nanaakwasiabayieboateng marciox3r giserh xychen2022 ustchope bupt-yxy jackzhang2000 djm2131 aditigarg2810 pbk0 stephanwlee maoxianxin tangwenwu-tww sarvex hanneshapke yueyedeai abhisharsinha 447555240 pukelevicius chjort thesinepainter srisai85 stnava lunatik00 genrry wansh619 ashok-arjun dewball345 pieterspecenier jiang-chd-yunnan st-12 opqrstuvcut irvifa invernizzi emla2805 vvying isabella232 noachache mbrukman aysegulbumin yonigottesman drashmi29 muskanmahajan486 kishor770 x3nosiz innat doytsujin matejmaricia darkseptember guttappa1238 hanemma7moud slportela polyshang armbiant dmgolembiowski arcaman07 aylinaydincs tomrtk lukewood nidhinradh ajunlonglive seanpm2001 gg-big-org rudacaya vishalsingh17 ukosuagwu aminhp easy-forks chanangaza abeltheo jlertle kllee0723 hdroum01 dasstyxx classicvalues lorenzobattistela sejalapeno 00mjk computerscienceiscool vertexcite dabin-ryu mvandermeulen marinazhang

similarity's Issues

update readme

Rewrite the readme with new instructions and information.

Investigate why Indexing class apperas slow in multishotsampler

Implement missing metrics

Port some of the SKlearn metrics such as

adjusted rand
v-measure

Rewrite readme

Readme content is outdated

Fix matcher documentation docstring

Missing return prototype

implement embedding explainability

Port the gradcam++ code

Allows access to the pairwise distances computed during training

Ensure callback can access the distances computed during the loss computation. Potentially store them in the loss object by subclassing the loss wrapper?

Add Text (IMDB) dataset

Currently experiments only include image datasets. Adding a text dataset highlights the performance of TF.similarity on text data.

Add pandas back to the setup.py

Make sure to fix the setup.py when all the PR are done and dusted.

Indexer does not support multihead

When having a multihead the indexer crash because we pass all the predict to it. Fix is to use output[0] by default and allows to specify one.

Implement custom train_step()

Make TF similarity a custom model

Add average match rank in the evaluation callback

import the visualization PR from Google Git

implement save() and load()

Add Documentation for TF Serving

Add a README to make it easier for users to get started with TF.similarity serving. The README will include setup instructions, documentation and a demonstration.

No .gitignore

.gitignore is needed to let Git know that it should ignore certain files and not track them

Explain module Import error and Tf model loading error

The Explain module is not being loaded correctly because it's path is incorrect. Loading saved models results in an error because of a missing custom_objects argument.