Coder Social home page Coder Social logo

all-in-one's Introduction

All-in-One

Official Code for All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment accepted by ACM MM 2023.

Requirements

  • python==3.8.18
  • torch==1.13.0
  • torchvision==0.14.0
  • torchaudio==0.13.0
  • timm==0.9.10

Results (AUC)

Method LaSOT LaSOTEXT OTB99-L TNL2K WebUAV-3M Model
All-in-One 72.8 55.8 71.0 55.9 58.5 All-in-One
Raw Results LaSOT LaSOTEXT OTB99-L TNL2K WebUAV-3M -

It should be noted that the above pretrained model is trained on an Ubuntu 18.04 server with multiple NVIDIA RTX A6000 Ada GPUs. The above results are reported using analysis_results.py. For WebUAV-3M, we recommend the official evaluation toolkit. This is a work in progress. More details will be described in our journal version. Download the model weights and raw results from Baidu Pan, extraction code: alli.

Evaluation

Download the model All-in-One, extraction code: alli. Add the model to $PROJECT_ROOT$/All-in-One/output/checkpoints/train/.

python tracking/test.py --dataset webuav3m --threads 8
python tracking/analysis_results.py

Before evaluation, please make sure the data path in local.py is correct.

Training

Download pre-trained MAE ViT-Base weights and put it to $PROJECT_ROOT$/All-in-One/lib/models/pretrained_models.

1.Training with one GPU.

cd /$PROJECT_ROOT$/All-in-One/lib/train
python run_training_all_in_one.py --save_dir ./output

2.Training with multiple GPUs.

cd /$PROJECT_ROOT$/All-in-One
python tracking/train.py --save_dir ./output --mode multiple --nproc_per_node 8

Before training, please make sure the data path in local.py is correct.

Thanks

This implementation is based on OSTrack. Please ref to their reposity for more details.

Citation

If you find that this project helps your research, please consider citing our paper:

@inproceedings{zhang2023all,
  title={All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment},
  author={Zhang, Chunhui and Sun, Xin and Yang, Yiqian and Liu, Li and Liu, Qiong and Zhou, Xi and Wang, Yanfeng},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={5552--5561},
  year={2023}
}

Contact

Feedbacks and comments are welcome! Feel free to contact us via [email protected].

all-in-one's People

Contributors

983632847 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

all-in-one's Issues

The language is useless.

In your implementations, the [CLS] token of the BertEmbedding is used. Without any attention operation, the [CLS] token of the BertEmbedding cannot take any information about the input description. In other words, the language modal is useless in the current implementations. Do I have any misunderstanding here?

lasot数据集精度复现问题

作者您好,非常感谢您的相关工作,我按照您开源的代码和相关配置进行了重新训练,从tnl2k数据集的AUC来看可以复现论文精度,但是lasot数据集的精度(AUC和P)无法复现,远远不及您开源的指标。请问您在训练过程中是否有一些特殊配置或者参数设置没有在这里特别指出?特别针对LASOT数据集,如果有的话,麻烦您提供一下,谢谢,期待您的回复

如何获取更加准确的GT?

作者您好,非常感谢您的工作。

在上一个issue中您提到了下面这一点:
改进WebUAV3M_train,TNL2K_train,OTB99L_train的标注(包括language descriptions)(启发自VASR [ICCV 2021],使用更加准确的GT训练模型能明显提点);或者使用class names等替换(启发自VLT_TT [NeurIPS2022, TPAMI submitting])

对于语言分支,使用class names等替换文本很容易实现,对于视觉分支,"使用更加准确的GT训练模型"该如何实现?
VASR可以提供更加准确的边界框注释,并且通过实验证明了精确的GT可以提升跟踪精度,但是相关工具并未开源,文章中提到的更准确的边界框注释也没有公开。
请问如何获取更加准确的GT?

VLT算法的结果文件

作者您好!
感谢你们的杰出贡献。
请问一下图7中的对比算法 VLT 的结果文件可否分享一下? 提前谢谢!
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.