You can run the pipeline on google colab using the following link
All what's needed to start using the pipeline locally is to have python 3.9+ then run the following command
pip install -r ./requirements.txt
A standalone tool for processing dataset of tagged images and calculating its metadata and the CLIP embeddings.
Process a directory of images (paths to directory of images or an archived dataset) and computes the images metadata along with its CLIP embeddings and writes the result into a JSON file in specified output folder.
- Process a tagged dataset (in this example is in
./dataset
folder) and save the output into./output
folder. In addition, the SQLite database nameddataset_cache.sqlite
with table nameddataset_cache
containing file name, hash and file path for dataset images will be created in the./output
folder.
python ./stage1/ImageDatasetProcessor.py --input_folder=./dataset
- Process a non-tagged dataset (in this example is in
./dataset
folder) and save the output into./output/clip-cache
folder. In addition, the SQLite database nameddataset_cache.sqlite
with table nameddataset_cache
containing file name, hash and file path for dataset images will be created in the./output/clip-cache
folder.
python ./stage1/ImageDatasetProcessor.py --input_folder=./dataset --tagged_dataset=False
input_folder
[string] - [required] - path to the directory containing sub-folders of each tag.output_folder
[string] - [optional] - path to the directory where to save the files into it.tagged_dataset
[bool] - [optional] - the dataset to process is a tagged dataset such that each each parent folder name is the tag of the images contained within it, default isTrue
clip_model
[str] - [optional] CLIP model to be used, default is'ViT-B-32'
pretrained
[str] - [optional] - the pre-trained model to be used for CLIP, default is'openai'
batch_size
[int] - [optional] - number of images to process at a time, default is32
.num_threads
[int] - [optional] - the number to be used in this process, default is4
device
[str] - [optional] - the device to be used in computing the CLIP embeddings, ifNone
is provided thencuda
will be used if available, default isNone
A script for training classification models based on
input-tag-to-image-hash-list.json
file andinput-metadata.json
file.
Given a metadata
json file containing embeddings for images and tag-to-image-hash
json file containing images' hash with tags, the script start to make for every tag two binary classification models and save it in output folder.
python ./stage2/train.py --metadata_json=./output/input-metadata.json --tag_to_hash_json=./output/input-tag-to-image-hash-list.json
If the
output
folder is not present, the script automatically creates it for you. Also you may call--help
to see the options and their defaults in the cli.
metadata_json
[string] - [required] - The path to the metadata json file.tag_to_hash_json
[string] - [required] - The path to tag-to-hash json file.output
[string] - [optional] - The path to the output directory.test_per
[float] - [optional] - The percentage of the test images from the dataset, default = 0.1. If thetest_per
is not created the script will make test set ~= 10% of the dataset.
A script for classification models inference given images'
directory
andmetadata_json
.json file. This stage does not process .zip (archived) file.
Given a metadata_json
json file containing embeddings for images and directory
of images' folder, the script start to loop over every image and make the classification for it using every binary classification model. If the --output
argument is not specified, the classification / inference result will be placed at ./output/tagging_output
folder. Time stamp will be appended to folder name (for example: ./output/tagging_output_2023_1_21_0_56
).
In addition, the SQLite database named score_cache.sqlite
with table named score_cache
containing file name, file path, file hash, model name, model type, model train date, tag string and tag score for given images will be created in the ./output
folder.
python ./stage3/classify.py --directory=./path/to/images/dir --metadata_json=./output/input-metadata.json --model_type=ovr-logistic-regression --tag=pos-character
Or
python ./stage3/classify.py --directory=./path/to/images/dir --metadata_json=./output/input-metadata.json --output=./output --output_bins=10 --model_type=ovr-logistic-regression --tag=pos-character
Note that if the
output
folder is not present, the script automatically creates it for you. Also you may call--help
to see the options and their defaults in the cli.
directory
[string] - [required] - The path to the test images folder.metadata_json
[string] - [required] - The path to the metadata json file for CLIP embeddings.output
[string] - [optional] - The path to the output directory for the inference results.model_type
[string] - [optional] - The type of the model (example:ovr-logistic-regression
,ovr-svm
,torch-logistic-regression
).tag
[string] - [optional] - Tag string (example:pos-character
,pos-environmental-space
, etc).output_bins
[int] - [optional] - The number of bins of the results for each model.
A script for classification models inference given images'
directory
andmetadata_json
.json file. This stage only process .zip (archived) file. It is used if the image data specified indirectory
argument is either in .zip archived format or containing images archived in .zip format.
Given a metadata_json
json file containing embeddings for images and directory
of images' folder, the script start to loop over every image and make the classification for it using every binary classification model. If the --output
argument is not specified, the classification / inference result will be placed at ./output/tagging_output
folder. Time stamp will be appended to folder name (for example: ./output/tagging_output_2023_1_21_0_56
).
In addition, the SQLite database named zip_score_cache.sqlite
with table named zip_score_cache
containing file name, file path, archive path, type of file, hash, model type, tag name and tag score for given images will be created in the output
folder.
python ./stage4/classify_zip.py --directory=./path/to/images/dir --metadata_json=./output/input-metadata.json --model_type=ovr-logistic-regression --tag=pos-character
Or
python ./stage4/classify_zip.py --directory=./path/to/images/dir --metadata_json=./output/input-metadata.json --output=./output --output_bins=10 --model_type=ovr-logistic-regression --tag=pos-character
Note that if the
output
folder is not present, the script automatically creates it for you. Also you may call--help
to see the options and their defaults in the cli.
directory
[string] - [required] - The path to the images folder or images .zip file.metadata_json
[string] - [required] - The path to the metadata json file for CLIP embeddings.output
[string] - [optional] - The path to the output directory for the inference results.model_type
[string] - [optional] - The type of the model (example:ovr-logistic-regression
,ovr-svm
,torch-logistic-regression
)tag
[string] - [optional] - Tag string (example:pos-character
,pos-environmental-space
, etc).output_bins
[int] - [optional] - The number of bins of the results for each model.
A tool to create file cache in the form of SQLite database, add data to and fetch from it. File cache contains attributes for each file in given directory or folder. The attributes for each file are represented by the following fields in file cache SQLite database table:
hash_id
,file_name
,path
,type
,is_archive
,n_content
andcontainer_archive
.
The file cache module defined in cache_file.py
contains the class definition with the following functions:
- class
cache_file
.FileCache
- A class to construct file cache object. create_file_cache
(out_dir = './output'
,db_name = 'file_cache.sqlite'
) - Method to create file cache database with default namefile_cache.sqlite
and default location in project./output
directory. The file cache database will not be created if the database with same name path already exist.add_folder_to_file_cache
(data_dir
,out_dir = './output'
,db_name = 'file_cache.sqlite'
) - Method to add image files contained in a folder / directory specified indata_dir
to file cache database specified indb_name
at locationout_dir
directory.get_random_hash
(_db_path
) - Method to fetch random file hash from file cache database specified indb_path
. This method returns random file hash instr
format.get_img_by_hash
(db_path
,hash_id
) - Method to fetch image file data from file cache database specified indb_path
with specific hashhash_id
. This method returns a dictionary with the following keys and its values:hash_id
- image file hash,file_name
- image file name,path
- image file path,type
- image file type,is_archive
- will beTrue
if the file is ZIP archive,n_content
- for ZIP archive, indicate the number of contained image file,container_archive
- the name of containing ZIP archive if the image file is contained in ZIP archive.get_random_image
(db_path
) - Method to fetch image file data from file cache database specified indb_path
. The return value is a dictionary with the same structure as a return value ofget_img_by_hash()
method above.clear_cache
(db_path
,delete_cache = False
) - Method to clear all data from the table in file cache database specified indb_path
. If thedelete_cache
argument is set toTrue
, the file cache SQLite database file will be removed.
from cache_file import FileCache
# Create file cache object
fileCache = FileCache()
# Create file cache database. Default to './output/file_cache.sqlite')
fileCache.create_file_cache()
# Adding image files contained in a folder or ZIP archive to file cache database
fileCache.add_folder_to_file_cache('./dataset/testdata1.zip')
# Get random file hash_id
hash_id = fileCache.get_random_hash('./output/file_cache.sqlite')
# Fetch image file data from file cache database with specific hash
img_dict = fileCache.get_img_by_hash('./output/file_cache.sqlite', hash_id)
''' Example img_dict:
{'file_name': '10_1.jpg',
'file_path': './dataset/testdata1.zip/testdata1/10_1.jpg',
'hash_id': '7bd45969bb6ffc6486fd560e42ab6ed1b788f3bb8541480be893da6ca4fcbff55b7d97e3505dc715fa962a23d24b7325a044206078390c08d63ac03d9fd4f67a',
'file_type': '.jpg',
'is_archive': None,
'n_content': None,
'container_archive': './dataset/testdata1.zip'
}
'''
# Fetch random image file data from file cache database.
img_dict = fileCache.get_random_image('./output/file_cache.sqlite')
# Clear all data from the table in file cache database.
fileCache.clear_cache('./output/file_cache.sqlite', delete_cache=False)
A tool to create, add and fetch CLIP vector data to .sqlite database file. CLIP cache contains CLIP vector data for each image file in given directory or folder. The attributes for each image file are represented by the following fields:
hash_id
,clip_vector
andmodel
.
The CLIP cache module defined in cache_clip.py
contains the class definition with the following functions:
- class
cache_clip
.ClipCache
- A class to construct CLIP cache object. create_clip_cache
(out_dir = './output'
,db_name = 'clip_cache.sqlite'
) - Method to create CLIP cache database with default nameclip_cache.sqlite
and default location in project./output
directory. The CLIP cache database will not be created if the database with same name path already exist.add_folder_to_clip_cache
(data_dir
,out_dir = './output'
,db_name = 'clip_cache.sqlite'
) - Method to add image files contained in a folder / directory specified indata_dir
to CLIP cache database specified indb_name
at locationout_dir
directory.get_random_hash
(_db_path
) - Method to fetch random file hash from CLIP cache database specified indb_path
. This method returns random file hash instr
format.get_clip_by_hash
(db_path
,hash_id
) - Method to fetch image CLIP vector from CLIP cache database specified indb_path
with specific hashhash_id
. This method returns a dictionary with the following keys and its values:hash_id
- image file hash instr
format,clip_vector
- image CLIP vector in numpy array format,model
- model type that is used to generate image CLIP vector instr
format.get_random_clip
(db_path
) - Method to fetch image CLIP vector from CLIP cache database specified indb_path
. The return value is a dictionary with the same structure as a return value ofget_clip_by_hash()
method above.clear_cache
(db_path
,delete_cache = False
) - Method to clear all data from the table in CLIP cache database specified indb_path
. If thedelete_cache
argument is set toTrue
, the CLIP cache SQLite database file will be removed.
from cache_clip import ClipCache
# Create CLIP cache object
clipCache = ClipCache()
# Create CLIP cache database. Default to './output/clip_cache.sqlite')
clipCache.create_clip_cache()
# Adding image files contained in a folder or ZIP archive to CLIP cache database
clipCache.add_folder_to_clip_cache('./dataset/testdata1.zip')
# Get random file hash_id
hash_id = clipCache.get_random_hash('./output/clip_cache.sqlite')
# Fetch CLIP vector data from CLIP cache database with specific hash
clip_dict = clipCache.get_clip_by_hash('./output/clip_cache.sqlite', hash_id)
# Fetch random CLIP vector data from CLIP cache database.
clip_dict = clipCache.get_random_clip('./output/clip_cache.sqlite')
# Getting data
clip_vector = clip_dict['clip_vector']
hash_id = clip_dict['hash_id']
model = clip_dict['model']
# Clear all data from the table in CLIP cache database.
clipCache.clear_cache('./output/clip_cache.sqlite', delete_cache=False)
A tool to create, add and fetch image tag data to .sqlite database file. Tag cache contains tag data for each image file in given directory or folder. The attributes for each image file are represented by the following fields:
hash_id
andtag
.
The tag cache module defined in cache_tag.py
contains the class definition with the following functions:
- class
cache_tag
.TagCache
- A class to construct tag cache object. create_tag_cache
(out_dir = './output'
,db_name = 'tag_cache.sqlite'
) - Method to create tag cache database with default nametag_cache.sqlite
and default location in project./output
directory. The tag cache database will not be created if the database with same name path already exist.add_folder_to_tag_cache
(data_dir
,out_dir = './output'
,db_name = 'tag_cache.sqlite'
) - Method to add image files contained in a folder / directory specified indata_dir
to tag cache database specified indb_name
at locationout_dir
directory.get_random_hash
(_db_path
) - Method to fetch random file hash id from tag cache database specified indb_path
. This method returns random file hash instr
format.get_tag_by_hash
(db_path
,hash_id
) - Method to fetch image tag from tag cache database specified indb_path
with specific hashhash_id
. This method returns image tag instr
format.get_hash_by_tag
(db_path
,tag
) - Method to get list of hash ids from tag cache database specified indb_path
for given tag stringtag
.get_random_tag
(db_path
) - Method to fetch image tag from tag cache database specified indb_path
. This method returns a dictionary with the following keys and its values:hash_id
- image file hash id instr
format,tag
- image tag instr
format. The return value is a dictionary with the same structure as a return value ofget_tag_by_hash()
clear_cache
(db_path
,delete_cache = False
) - Method to clear all data from the table in tag cache database specified indb_path
. If thedelete_cache
argument is set toTrue
, the tag cache SQLite database file will be removed.
from cache_tag import TagCache
# Create tag cache object
tagCache = TagCache()
# Create tag cache database. Default to './output/tag_cache.sqlite')
tagCache.create_tag_cache()
# Adding image files contained in a folder or ZIP archive to tag cache database
tagCache.add_folder_to_tag_cache('./dataset/testdata1.zip')
# Get random file hash_id
hash_id = tagCache.get_random_hash('./output/tag_cache.sqlite')
# Fetch tag string from tag cache database with specific hash
tag_str = tagCache.get_tag_by_hash('./output/tag_cache.sqlite', hash_id)
# Fetch list of hash ids for specific tag string
hash_id_list = tagCache.get_hash_by_tag('./output/tag_cache.sqlite', tag_str)
# Fetch random hash_id and tag pair from tag cache database.
tag_dict = tagCache.get_random_tag('./output/tag_cache.sqlite')
hash_id = tag_dict['hash_id']
tag = tag_dict['tag']
print (f'hash: {hash_id}, tag: {tag}')
'''Example output
{'hash_id': '5fed6cf0ff1028f58f6cc73bb09e142eb4aacf1f5a206327fda3ef3db8cfcbf3e643ec1126238115d0fcfbe35be6e3204b21e06d1c3a7f0229e6aa18696ee5da', 'tag': 'other-validation'}
'''
# Clear all data from the table in tag cache database.
tagCache.clear_cache('./output/tag_cache.sqlite', delete_cache=False)
A web API for getting random image file and show its relevant data as follows: file name, file path, image size, container archive, hash ID, request time and the image itself.
The web API module defined in file_cache_web_api.py
runs FLASK-based server and contains functions to fetch random file from file cache database and returns its respective data (file name, file path, image size, container archive, hash ID, request time and the image itself) based on HTTP request made from web browser.
Start the web API module form CLI. In default, the server runs on host 0.0.0.0
and port 8080
.
python file_cache_web_api.py
or start the web API in other host and port using host
and port
CLI arguments as the follows.
python file_cache_web_api.py --host=0.0.0.0 --port=8000
Fetch random image from file cache database (created using File Cache Module in file_cache.py
) specified in db_path
. The following URL request (made from web browser) will return HTML page containing file name, file path, image size, container archive, hash ID, request time and the image itself. Specify the db_path
as argument with query string using '?' and its value after =
.
http://127.0.0.1:8080/get_random_img?db_path=./output/file_cache.sqlite
An API to list, access and use existing classifier models. Model API contains function that accesses existing classifier model pickle files (in given path) and returns existing classifier model as Python dictionary.
The model API defined in api_model.py
contains the class definition with the following functions:
- class
api_model
.ModelApi
- A class to construct the model loader object. get_models_dict
(models_path
) - Method that returns models dictionary for model pickle file in givenmodels_path
.
from api_model import ModelApi
# Create model loader object
model_api = ModelApi()
# Get models dictionary to models_dict for model pickle files in given models_path.
models_dict = model_api.get_models_dict(models_path='./output/models')
'''
Example stucture of models_dict
{<model_name>:
{'classifier' : <model object>,
'model_type' : <model type string>,
'train_start_time' : <training start time datetime object>
'tag' : <tag string>
}
}
'''
Web API to list existing classifier models (name, type, training start time and tag string) in JSON format.
The model web API defined in webapi_model.py
runs FLASK-based server and contains functions to list existing classifier models (name, type, training start time and tag string) in JSON format based on HTTP request made from web browser.
Start the model web API form CLI. In default, the server runs on host 0.0.0.0
and port 8080
.
python webapi_model.py
or start the web API in other host and port using host
and port
CLI arguments as the follows.
python webapi_model.py --host=0.0.0.0 --port=8000
List existing classifier models (name, type, training start time and tag string)
http://127.0.0.1:8080/get_models
A tool for:
- Get list of image files for specific model and score range from classification result / score cache created in classification Stage 3 and Stage 4 (
score_cache.sqlite
orzip_score_cache.sqlite
)- Clearing classification result / score cache created in classification Stage 3 and Stage 4 (
score_cache.sqlite
orzip_score_cache.sqlite
) from entry with model training date older than the training date of respective current models in specified models directory.
The model cache defined in model_cache.py
contains the class definition with the following functions:
- class
model_cache
.ModelCache
- A class to construct the model cache object. get_img_from_score_cache
(models_name
,score_gte = 0.0
,score_lte = 1.0
,db_path = './output/score_cache.sqlite'
,db_table_name = 'score_cache'
) - Returns list of file names for specificmodel_name
and score betweenscore_gte
andscore_lte
from classification cache indb_path
with table namedb_table_name
.clear_score_cache_by_model_date
(models_path = './output/models'
,score_cache_path = './output/score_cache.sqlite'
,score_cache_table_name = 'score_cache'
) - Clearing classification result / score cache inscore_cache_path
from entry with model training date older than the training date of respective current models (inmodels_path
).
Get list of image files for specific model and score range
from model_cache import ModelCache
# Create model cache object
model_cache = ModelCache()
# Get files dictionary
files_dict = model_cache.get_img_from_score_cache(model_name='model-ovr-logistic-regression-tag-pos-character', score_gte=0.9, score_lte=1.0)
Clearing score cache based on model's training date. The following example will eliminate entry in score cache score_cache.sqlite
if the model's training date is older than the training date of respsctive model in ./output/models
directory.
from model_cache import ModelCache
# Create model cache object
model_cache = ModelCache()
# Clearing score cache for entries with outdated model
is_success, deleted_entries = model_cache.clear_score_cache_by_model_date()
# is_success will be True when success or False if there is an error
# deleted_entries will contain list of deleted entry dictionary.
# The method will print the original number of entries, number of deleted entriee and current number of entries.
Get list of image files for specific model and score range (Web API version)
Start the model cache web API form CLI. In default, the server runs on host 0.0.0.0
and port 8080
.
python webapi_model_cache.py
or start the web API in other host and port using host
and port
CLI arguments as the follows.
python webapi_model_cache.py --host=0.0.0.0 --port=8000
List existing classifier models (name, type, training start time and tag string)
http://127.0.0.1:8080/get_images_in_tag_score_range?model_name=model-ovr-logistic-regression-tag-not-pixel-art&score_gte=0.9&score_lte=1.0
Example script of using CLIP vectors to find Similar images in set of images
Run the Quality Model App on google colab using the following link
An IPython-based app for quality model.
Run the Quality Model App on google colab using the following link
Different ways for downloading data from Mega.
Run the Quality Model App on google colab using the following link