This is the course design of Pattern Recognition and Machine Learning in Huazhong University of Science and Technology, School of Artificial Intelligence and Automation
TransT presents an attention-based network to achieve precise and robust detection and tracking by fusing template and search features. Inspired by TransT, we propose a pseudo-Siamese network that is independent at the lower level and shared at the higher level based on the characteristics of heterogeneous image-matching tasks. In the experiment, we discuss different backbones as well as different strategies of feature extraction for template and search images. Besides, we simplify the attention module in TransT according to the characteristics of image matching.
To run a model, run
python run_train.py
To get the test metrics, run
python run_test.py
To see a demo, run
python demo.py
Dataset: M3FD
M3FD is a paired visible and infrared images dataset which contains 6 kinds of targets: {People, Car, Bus, Motorcycle, Lamp, Truck}.
The following is the results of applying different backbone processing strategies to TransT on the M3FD test set.
Backbone processing strategy | single backbone | low-sep-high-sharing double backbones | independent double backbones |
---|---|---|---|
Model parameters | 23.0M | 23.2M | 31.6M |
FLOPs | 25.49G | 25.49G | 25.49G |
mIOU | 0.71 | 0.80 | 0.80 |
P0.5 | 86.26% | 95.10% | 92.91% |
P0.7 | 70.35% | 88.79% | 86.95% |
(P0.5,P0.7 represents the ratio of IOU above 0.5,0.7 respectively)
###Different Backbones The following is the results of applying the low-sep-high-sharing backbone processing strategies to model whose backbone is different from TranSt on the M3FD test set.
Backbone Network | ResNet50 | MobileNetv3 | CSPNet |
---|---|---|---|
Model parameters | 23.2M | 17.4M | 26.7M |
FLOPs | 25.49G | 15.34G | 29.42G |
mIOU | 0.80 | 0.81 | 0.77 |
P0.5 | 95.10% | 96.66% | 91.78% |
P0.7 | 88.79% | 88.63% | 84.09% |
Appling the technology of multi-scale feature map fusion to above backbone.
Backbone Network | ResNet50(multi) | MobileNetv3(multi) | CSPNet(multi) |
---|---|---|---|
Model parameters | 24.3M | 17.9M | 27.6M |
FLOPs | 26.16G | 15.64G | 29.98G |
mIOU | 0.80 | 0.81 | 0.81 |
P0.5 | 94.75% | 96.33% | 93.91% |
P0.7 | 86.05% | 88.43% | 88.84% |
We conduct experiments on homologous image matching based on M3FD's Visible image part.Note that it is now a Siamese network.
Backbone Network | ResNet50 | MobileNetv3(multi) | CSPNet(multi) |
---|---|---|---|
Model parameters | 23.0M | 17.7M | 27.0M |
FLOPs | 25.49G | 15.64G | 29.98G |
mIOU | 0.92 | 0.89 | 0.91 |
P0.5 | 99.25% | 99.20% | 99.61% |
P0.7 | 96.61% | 97.75% | 95.86% |
We apply the above model to the COCO dataset without fine-tuning.
Backbone Network | ResNet50 | MobileNetv3(multi) | CSPNet(multi) |
---|---|---|---|
mIOU | 0.85 | 0.81 | 0.84 |
P0.5 | 95.36% | 94.33% | 94.81% |
P0.7 | 87.25% | 82.75% | 86.12% |
The following is the results of applying TransT's and ours attention module on the M3FD test set.
Attention Module | TransT(x4) | Ours(x5) |
---|---|---|
Model parameters | 23.2M | 19.8M |
FLOPs | 25.49G | 25.49G |
mIOU | 0.80 | 0.83 |
P0.5 | 99.25% | 99.20% |
P0.7 | 96.61% | 97.75% |
(x4,x5 represents the number of layers stacked)
@inproceedings{TransT,
title={Transformer Tracking},
author={Chen, Xin and Yan, Bin and Zhu, Jiawen and Wang, Dong and Yang, Xiaoyun and Lu, Huchuan},
booktitle={CVPR},
year={2021}
}
@inproceedings{TarDAL,
title={Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection},
author={Jinyuan Liu, Xin Fan*, Zhangbo Huang, Guanyao Wu, Risheng Liu , Wei Zhong, Zhongxuan Luo},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2022}
}