Coder Social home page Coder Social logo

jeozhao / embedding-dataset-reordering Goto Github PK

View Code? Open in Web Editor NEW

This project forked from veldrovive/embedding-dataset-reordering

0.0 0.0 0.0 74 KB

Reorders the embeddings generated by CLIP to be in line with the webdatasets generated by img2dataset.

License: MIT License

Python 95.51% Makefile 4.49%

embedding-dataset-reordering's Introduction

Clip Embedding Reordering

Note: This currently relies on using the --output_format="webdataset" option from img2dataset. If your images are not inside .tar files, this will not work correctly. CLIP embeddings generated by clip-retrieval are not ordered the same as the webdataset they are generated from. This tool can reorder large CLIP embedding datasets such that they match the order of the image dataset they were generated from.

Install

git clone https://github.com/Veldrovive/embedding-dataset-reordering

cd embedding-dataset-reordering

pip install -e .

API

This module exposes three functions. Example commands are meant to be evaluated from inside the examples folder.

For example, to download the test dataset with img2dataset, navigate to the root directory and run cd examples && reorder-embeddings download-data.

To generate embeddings with clip-retrieval for this test data, run reorder-embeddings clip-inference from the examples folder.


reorder: Takes as input an unordered embedding dataset along with metadata generated by clip-retrieval and reorders the embeddings to match the order of the image dataset.

Note: Before starting, you need to find the shard string width and index string width of your dataset. This is a manual task, but it is easy to find. Navigate to the metadata directory of your embedding dataset and run reorder-embeddings example_key.

This will print something similar to:

Example Keys:
Shard 3 has keys ['0000309', '0000321']
Shard 2 has keys ['0000209', '0000237']
Shard 0 has keys ['0000022', '0000031']
Shard 1 has keys ['0000114', '0000123']

By inspection, we can see that the first 5 characters represent the index of the shard (i.e. the keys for shard 3 start with 00003) so the final 3 digits reprent the index which means the index width is 3.

Parameters

  • embeddings_folder: Path to the folder containing the embedding .npy files.
  • metadata_folder: Path to the folder containing the .parquet metadata files.
  • output_folder: Path to the folder where the reordered .npy files will be saved.
  • index_width: The index width found above.
  • output_shard_width: The width of the shard string for the output files. Should be the same as the shard with for the webdataset.
  • limit: The number of shards to reorder.
  • run-concurrent: The number of workers to use during reordering.
  • verbose: Whether to print out expanded logging.
  • tmp-folder: With many workers, the temporary file directories get very large. If this is a problem, reduce the number of workers or set tmp-folder to a location with more space available.

download-data: Uses img2dataset to download a test dataset. Run this from the examples directory to download the default one.

clip_-nference: Uses clip-retrieval to generate embeddings for the test dataset. Run this from the examples directory after downloading the test dataset.

For development

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

embedding-dataset-reordering's People

Contributors

veldrovive avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.