Light

983632847 / all-in-one Goto Github PK

View Code? Open in Web Editor NEW

11.0 3.0 0.0 475 KB

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

License: MIT License

Shell 0.32% Python 99.33% Jupyter Notebook 0.36%

all-in-one's Introduction

All-in-One

Official Code for All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment accepted by ACM MM 2023.

Requirements

python==3.8.18
torch==1.13.0
torchvision==0.14.0
torchaudio==0.13.0
timm==0.9.10

Results (AUC)

Method	LaSOT	LaSOTEXT	OTB99-L	TNL2K	WebUAV-3M	Model
All-in-One	72.8	55.8	71.0	55.9	58.5	All-in-One
Raw Results	LaSOT	LaSOTEXT	OTB99-L	TNL2K	WebUAV-3M	-

It should be noted that the above pretrained model is trained on an Ubuntu 18.04 server with multiple NVIDIA RTX A6000 Ada GPUs. The above results are reported using analysis_results.py. For WebUAV-3M, we recommend the official evaluation toolkit. This is a work in progress. More details will be described in our journal version. Download the model weights and raw results from Baidu Pan, extraction code: alli.

Evaluation

Download the model All-in-One, extraction code: alli. Add the model to $PROJECT_ROOT$/All-in-One/output/checkpoints/train/.

python tracking/test.py --dataset webuav3m --threads 8
python tracking/analysis_results.py

Before evaluation, please make sure the data path in local.py is correct.

Training

Download pre-trained MAE ViT-Base weights and put it to $PROJECT_ROOT$/All-in-One/lib/models/pretrained_models.

1.Training with one GPU.

cd /$PROJECT_ROOT$/All-in-One/lib/train
python run_training_all_in_one.py --save_dir ./output

2.Training with multiple GPUs.

cd /$PROJECT_ROOT$/All-in-One
python tracking/train.py --save_dir ./output --mode multiple --nproc_per_node 8

Before training, please make sure the data path in local.py is correct.

Thanks

This implementation is based on OSTrack. Please ref to their reposity for more details.

Citation

If you find that this project helps your research, please consider citing our paper:

@inproceedings{zhang2023all,
  title={All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment},
  author={Zhang, Chunhui and Sun, Xin and Yang, Yiqian and Liu, Li and Liu, Qiong and Zhou, Xi and Wang, Yanfeng},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={5552--5561},
  year={2023}
}

Contact

Feedbacks and comments are welcome! Feel free to contact us via [email protected].

all-in-one's People

Contributors

Stargazers

Watchers

all-in-one's Issues

The language is useless.

In your implementations, the [CLS] token of the BertEmbedding is used. Without any attention operation, the [CLS] token of the BertEmbedding cannot take any information about the input description. In other words, the language modal is useless in the current implementations. Do I have any misunderstanding here?

lasot数据集精度复现问题

作者您好，非常感谢您的相关工作，我按照您开源的代码和相关配置进行了重新训练，从tnl2k数据集的AUC来看可以复现论文精度，但是lasot数据集的精度（AUC和P）无法复现，远远不及您开源的指标。请问您在训练过程中是否有一些特殊配置或者参数设置没有在这里特别指出？特别针对LASOT数据集，如果有的话，麻烦您提供一下，谢谢，期待您的回复

如何获取更加准确的GT？

作者您好，非常感谢您的工作。

在上一个issue中您提到了下面这一点：
改进WebUAV3M_train，TNL2K_train，OTB99L_train的标注（包括language descriptions）（启发自VASR [ICCV 2021]，使用更加准确的GT训练模型能明显提点）；或者使用class names等替换（启发自VLT_TT [NeurIPS2022, TPAMI submitting]）

对于语言分支，使用class names等替换文本很容易实现，对于视觉分支，"使用更加准确的GT训练模型"该如何实现？
VASR可以提供更加准确的边界框注释，并且通过实验证明了精确的GT可以提升跟踪精度，但是相关工具并未开源，文章中提到的更准确的边界框注释也没有公开。
请问如何获取更加准确的GT？

VLT算法的结果文件

作者您好！
感谢你们的杰出贡献。
请问一下图7中的对比算法 VLT 的结果文件可否分享一下? 提前谢谢！

您好，请问您使用的是什么配置的服务器进行训练的？完整训练一次需要多长时间呢？

请问您的代码需要提前准备bert的预训练模型吗？还是说直接运行训练就可以呢？

When will the code be open source, please？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.