Coder Social home page Coder Social logo

lu-feng / selavpr Goto Github PK

View Code? Open in Web Editor NEW
159.0 3.0 9.0 1.28 MB

Official repository for the ICLR 2024 paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition".

License: MIT License

Python 100.00%
image-localization visual-geolocalization visual-place-recognition loop-closure-detection adapter relocalization visual-slam

selavpr's Introduction

SelaVPR

PWC PWC PWC PWC PWC PWC

This is the official repository for the ICLR 2024 paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition".

Summary

This paper presents a novel method to realize Seamless adaptation of pre-trained foundation models for the (two-stage) VPR task, named SelaVPR. By adding a few tunable lightweight adapters to the frozen pre-trained model, we achieve an efficient hybrid global-local adaptation to get both global features for retrieving candidate places and dense local features for re-ranking. The SelaVPR feature representation can focus on discriminative landmarks, thus closing the gap between the pre-training and VPR tasks (fully unleash the capability of pre-trained models for VPR). SelaVPR can directly match the local features without spatial verification, making the re-ranking much faster.

The global adaptation is achieved by adding adapters after the multi-head attention layer and in parallel to the MLP layer in each transformer block (see adapter1 and adapter2 in /backbone/dinov2/block.py).

The local adaptation is implemented by adding up-convolutional layers after the entire ViT backbone to upsample the feature map and get dense local features (see LocalAdapt in network.py).

Getting Started

This repo follows the Visual Geo-localization Benchmark. You can refer to it (VPR-datasets-downloader) to prepare datasets.

The dataset should be organized in a directory tree as such:

├── datasets_vg
    └── datasets
        └── pitts30k
            └── images
                ├── train
                │   ├── database
                │   └── queries
                ├── val
                │   ├── database
                │   └── queries
                └── test
                    ├── database
                    └── queries

Before training, you should download the pre-trained foundation model DINOv2(ViT-L/14) HERE.

Train

Finetuning on MSLS

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/pre-trained/dinov2_vitl14_pretrain.pth

Further finetuning on Pitts30k

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --queries_per_epoch=5000 --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth

Trained Models

The model finetuned on MSLS (for diverse scenes).

DOWNLOAD
MSLS-val Nordland-test St. Lucia
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
LINK 90.8 96.4 97.2 85.2 95.5 98.5 99.8 100.0 100.0

The model further finetuned on Pitts30k (only for urban scenes).

DOWNLOAD
Tokyo24/7 Pitts30k Pitts250k
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
LINK 94.0 96.8 97.5 92.8 96.8 97.7 95.7 98.8 99.2

Test

Set rerank_num=100 to reproduce the results in paper, and set rerank_num=20 to achieve a close result with only 1/5 re-ranking runtime (0.018s for a query).

python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=/path/to/finetuned/pitts30k/model/SelaVPR_pitts30k.pth --rerank_num=100

Local Matching using DINOv2+Registers

By adding registers, DINOv2 can achieve better local matching performance. A pre-trained DINOv2+registers model can be downloaded HERE.

You can simply add --registers to the (train or test) run command and load the model with registers to use the SelaVPR model based on DINOv2+registers backbone, for example

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/pre-trained/dinov2_vitl14_reg4_pretrain.pth --registers
python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --resume=/path/to/finetuned/msls/model/SelaVPR_reg4_msls.pth --rerank_num=100 --registers

The finetuned (on MSLS) SelaVPR model with registers can be downloaded HERE.

For the (dense or coarse) local matching between two images, run

python3 visualize_pairs.py --datasets_folder=./ --resume=/path/to/finetuned/msls/model/SelaVPR_reg4_msls.pth --registers

Efficient RAM Usage (optional)

The test_efficient_ram_usage() function in test.py is used to address the issue of RAM out of memory (this issue may cause the program to be killed). This function saves the extracted local features in ./output_local_features/ and loads only the local features currently needed into RAM each time. You can simply add --efficient_ram_testing to the (train or test) run command to use it, for example

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --queries_per_epoch=5000 --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth --efficient_ram_testing
python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=/path/to/finetuned/pitts30k/model/SelaVPR_pitts30k.pth --rerank_num=100 --efficient_ram_testing

More Details about Datasets

MSLS-val: We use the official version of MSLS-val (only contains 740 query images) for testing, which is a subset of the MSLS-val formated by VPR-datasets-downloader (contains about 11k query images). More detail can be found here.

Nordland-test: Download the Downsampled version here.

Related Work

Our another work CricaVPR (one-stage VPR based on DINOv2) presents a multi-scale convolution-enhanced adaptation method and achieves SOTA performance on several datasets. The code is released at HERE.

Acknowledgements

Parts of this repo are inspired by the following repositories:

Visual Geo-localization Benchmark

DINOv2

Citation

If you find this repo useful for your research, please consider leaving a star⭐️ and citing the paper

@inproceedings{selavpr,
  title={Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition},
  author={Lu, Feng and Zhang, Lijun and Lan, Xiangyuan and Dong, Shuting and Wang, Yaowei and Yuan, Chun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

selavpr's People

Contributors

lu-feng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

selavpr's Issues

Nordland

By how many frames is your nordland dataset extracted? I got the results by 10 frames 87.2, 93.9, 95.6.

Question about different results when trying to reproduce

Thank you for sharing your work, your paper was very interesting and the results are also very impressive!

I had a question regarding the evaluation on MSLS-val. I attempted to reproduce your results by following the repository, downloading the data, and training the model as described in the README. Initially, I trained the model solely on the MSLS dataset. I attempted to evaluate the results on MSLS by executing the following command for both my trained model and the provided trained model:

python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth --rerank_num=100

However, these were the results that I obtained:

Model R@1 R@5 R@10
Claimed performance in README 90.8 96.4 97.2
Self-trained model 87.0 94.0 95.6
Downloaded model 86.6 93.8 95.6

Further fine-tuning the model on Pitts30k and evaluating it gave the same results as you had in your README for evaluation on Pitts30k. Therefore, I'm wondering if you could help me understand why there's a difference for the MSLS-val. Am I evaluating with the wrong data, or is there something else I might be missing?

speed and performance about SelaVPR

感谢作者的工作,恭喜!

还没有跑代码,但有几个疑问:

  1. readme中说rerank_num设置为20的速度为0.018s,请问一下图像分辨率和使用的机器配置是怎样?完整的SelaVPR(带reranking)方法和mixvpr在速度上对比如何?
    2.该方法使用到不同数据集时候,都需要finetune吗?还是说直接把预训练模型拿来用即可?(比如都是室外街景数据)
  2. 完整的SelaVPR(带reranking)方法,在室内数据集上效果如何?

期待回答,感谢。

how to view the keypoint matching in picture

Thanks for sharing your work, your paper was very interesting and the results are also very impressive!
i have a question that how to view the keypoint matching in picture, the variable kps is defined but bot used, could you please tell me how to get the matching keypoints? thank you very much!

The RAM

Thank you for sharing your work, your paper was very interesting and the results are also very impressive!When I trained the pre-trained model on the msls datasets,I met the following problem.Because the RAM was out of memory,my programa was killed by the system.How can I solve the issue?

Questions about image input size and model initialization

Hello! Thank you for such an impressive work.
I have some questions about the model input in this work and hope to get your help.

In file network.py, when getting the model backbone, the input size of the image is set to 518, and the subsequent tests are all scaled to (224, 224). Can the maximum input image be (518, 518)?

if args.registers:
        backbone = vit_large(patch_size=14,img_size=518,init_values=1,block_chunks=0, num_register_tokens=4)
else:
        backbone = vit_large(patch_size=14,img_size=518,init_values=1,block_chunks=0)

For images with large length and width differences, such as (1200, 680), when they are transformed into a rectangular size of (224, 224), the information of the longer side dimension of the image will be lost. Can I scale them down proportionally and input them into the model while being divisible by patch_size? That is, convert (1200, 680) to (588, 336) (divisible by 14).

Since I have just started to learn about ViT models, many basic questions are not clear to me. I hope to get your reply. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.