This repo contains a custom implementation of the supervised GroundeR [1] architecture for visual grounding. Scripts include data processing (Flickr30K Entities [2]), training and evaluation pipelines, and model files.
- [1] Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016.
- [2] Plummer, Bryan A., et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE international conference on computer vision. 2015.