Generate synthetic captions for images, audio and videos to create better datasets.
Benchmark different captioner model to select best ones: benchmark.sh
You can examine how many resources (gpu memory, number of gpu hrs) you would need to caption images using set of models and how good those captioners are (e.g. using ClipScore). See benchmark_config.json for details.