Heterologous Image Matching

This is the course design of Pattern Recognition and Machine Learning in Huazhong University of Science and Technology, School of Artificial Intelligence and Automation

Introduction

TransT presents an attention-based network to achieve precise and robust detection and tracking by fusing template and search features. Inspired by TransT, we propose a pseudo-Siamese network that is independent at the lower level and shared at the higher level based on the characteristics of heterogeneous image-matching tasks. In the experiment, we discuss different backbones as well as different strategies of feature extraction for template and search images. Besides, we simplify the attention module in TransT according to the characteristics of image matching.

Quick Start

Train

To run a model, run

python run_train.py

Evaluation

To get the test metrics, run

python run_test.py

Demo

To see a demo, run

python demo.py

Experiment result

Dataset: M3FD

M3FD is a paired visible and infrared images dataset which contains 6 kinds of targets: {People, Car, Bus, Motorcycle, Lamp, Truck}.

Backbone Processing Strategy

The following is the results of applying different backbone processing strategies to TransT on the M3FD test set.

_{Backbone processing strategy}	_{single backbone}	_{low-sep-high-sharing double backbones}	_{independent double backbones}
_{Model parameters}	_23.0M	_23.2M	_31.6M
_FLOPs	_25.49G	_25.49G	_25.49G
_mIOU	_0.71	_0.80	_0.80
_P0.5	_86.26%	_95.10%	_92.91%
_P0.7	_70.35%	_88.79%	_86.95%

(P0.5,P0.7 represents the ratio of IOU above 0.5,0.7 respectively)

###Different Backbones The following is the results of applying the low-sep-high-sharing backbone processing strategies to model whose backbone is different from TranSt on the M3FD test set.

_{Backbone Network}	_ResNet50	_MobileNetv3	_CSPNet
_{Model parameters}	_23.2M	_17.4M	_26.7M
_FLOPs	_25.49G	_15.34G	_29.42G
_mIOU	_0.80	_0.81	_0.77
_P0.5	_95.10%	_96.66%	_91.78%
_P0.7	_88.79%	_88.63%	_84.09%

Appling the technology of multi-scale feature map fusion to above backbone.

_{Backbone Network}	_{ResNet50(multi)}	_{MobileNetv3(multi)}	_{CSPNet(multi)}
_{Model parameters}	_24.3M	_17.9M	_27.6M
_FLOPs	_26.16G	_15.64G	_29.98G
_mIOU	_0.80	_0.81	_0.81
_P0.5	_94.75%	_96.33%	_93.91%
_P0.7	_86.05%	_88.43%	_88.84%

We conduct experiments on homologous image matching based on M3FD's Visible image part.Note that it is now a Siamese network.

_{Backbone Network}	_ResNet50	_{MobileNetv3(multi)}	_{CSPNet(multi)}
_{Model parameters}	_23.0M	_17.7M	_27.0M
_FLOPs	_25.49G	_15.64G	_29.98G
_mIOU	_0.92	_0.89	_0.91
_P0.5	_99.25%	_99.20%	_99.61%
_P0.7	_96.61%	_97.75%	_95.86%

We apply the above model to the COCO dataset without fine-tuning.

_{Backbone Network}	_ResNet50	_{MobileNetv3(multi)}	_{CSPNet(multi)}
_mIOU	_0.85	_0.81	_0.84
_P0.5	_95.36%	_94.33%	_94.81%
_P0.7	_87.25%	_82.75%	_86.12%

Experiments on Attention Modules

The following is the results of applying TransT's and ours attention module on the M3FD test set.

_{Attention Module}	_TransT(x4)	_Ours(x5)
_{Model parameters}	_23.2M	_19.8M
_FLOPs	_25.49G	_25.49G
_mIOU	_0.80	_0.83
_P0.5	_99.25%	_99.20%
_P0.7	_96.61%	_97.75%

(x4,x5 represents the number of layers stacked)

Reference

@inproceedings{TransT,
title={Transformer Tracking},
author={Chen, Xin and Yan, Bin and Zhu, Jiawen and Wang, Dong and Yang, Xiaoyun and Lu, Huchuan},
booktitle={CVPR},
year={2021}
}

@inproceedings{TarDAL,
  title={Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection},
  author={Jinyuan Liu, Xin Fan*, Zhangbo Huang, Guanyao Wu, Risheng Liu , Wei Zhong, Zhongxuan Luo},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

hustmx721 / heterologous-image-matching Goto Github PK