lu-feng / cricavpr Goto Github PK

Official repository for the CVPR 2024 paper "CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition".

License: MIT License

Python 100.00%

image-localization loop-closure-detection visual-geolocalization visual-place-recognition adapter relocalization visual-slam

cricavpr's Introduction

CricaVPR

This is the official repository for the CVPR 2024 paper "CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition".

Getting Started

This repo follows the framework of GSV-Cities for training, and the Visual Geo-localization Benchmark for evaluation. You can download the GSV-Cities datasets HERE, and refer to VPR-datasets-downloader to prepare test datasets.

The test dataset should be organized in a directory tree as such:

├── datasets_vg
    └── datasets
        └── pitts30k
            └── images
                ├── train
                │   ├── database
                │   └── queries
                ├── val
                │   ├── database
                │   └── queries
                └── test
                    ├── database
                    └── queries

Before training, you should download the pre-trained foundation model DINOv2(ViT-B/14) HERE.

Train

python3 train.py --eval_datasets_folder=/path/to/your/datasets_vg/datasets --eval_dataset_name=pitts30k --foundation_model_path=/path/to/pre-trained/dinov2_vitb14_pretrain.pth --epochs_num=10

Test

To evaluate the trained model:

python3 eval.py --eval_datasets_folder=/path/to/your/datasets_vg/datasets --eval_dataset_name=pitts30k --resume=/path/to/trained/model/CricaVPR.pth

To add PCA:

python3 eval.py --eval_datasets_folder=/path/to/your/datasets_vg/datasets --eval_dataset_name=pitts30k --resume=/path/to/trained/model/CricaVPR.pth --pca_dim=4096 --pca_dataset_folder=pitts30k/images/train

Trained Model

You can directly download the trained model HERE.

Related Work

Our another work (two-stage VPR based on DINOv2) SelaVPR achieved SOTA performance on several datasets. The code is released at HERE.

Acknowledgements

Parts of this repo are inspired by the following repositories:

GSV-Cities

Visual Geo-localization Benchmark

DINOv2

Citation

If you find this repo useful for your research, please consider leaving a star⭐️ and citing the paper

@inproceedings{lu2024cricavpr,
  title={CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition},
  author={Lu, Feng and Lan, Xiangyuan and Zhang, Lijun and Jiang, Dongmei and Wang, Yaowei and Yuan, Chun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month={June},
  year={2024}
}

cricavpr's People

Contributors

Stargazers

Watchers

Forkers

2catycm jiajiajun whu-lyh jw2394 kecenyao-deepmirror

cricavpr's Issues

Releasing the model on torch.hub?

Hi, thank you for uploading the code and trained models!
Could you release the model on torch.hub? It is quite simple to do and allows people to use your model with two lines of code, allowing more people to use your model and helping to spread your work!
For example I did it for CosPlace, and the trained model can be automatically downloaded from anywhere without cloning the repo just like this

import torch
model = torch.hub.load("gmberton/cosplace", "get_trained_model", backbone="ResNet50", fc_output_dim=2048)

so it would be easier for everyone to use and I could also add it to this benchmarking repo.

PS: I really like the focus on small dimensional features, congrats on the paper!

questions about test dataset

Hello, I used the testing program for Pitts30k in NetVLAD and used the weight file you provided to reduce the dimensionality to 4096 using PCA. The results I obtained seem to be inconsistent with those in the paper. Is there a problem with the dataset?
R@1: 88.0, R@5: 94.7, R@10: 96.6, R@20: 97.4

question about test

Hi, @Lu-Feng
Thanks you for an interesting research!

I have some question about cross image encoder.

from the question link below,
#7 (comment)

I think that cross image encoder could use prior knowledge(consecutiveness of queries) of test dataset

from supplemetary material (table 10),
there is an ablation study of cross-image encoder
the performance on pitts30k is written below,
No encoder : 90.6, 95.9, 97.2
Transformer encoder layer x 2(default) : 94.8, 97.4, 98.1

compare this performance of crica in the above link,

when test dataloader use shuffle=True,

2024-05-02 12:47:59   [0]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.2, R@5: 96.2, R@10: 97.2, R@100: 99.4
2024-05-02 12:49:38   [1]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 91.8, R@5: 96.1, R@10: 97.3, R@100: 99.4
2024-05-02 12:51:17   [2]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.0, R@5: 96.1, R@10: 97.2, R@100: 99.4
2024-05-02 12:52:57   [3]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 91.9, R@5: 96.1, R@10: 97.3, R@100: 99.4
2024-05-02 12:54:37   [4]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.0, R@5: 96.0, R@10: 97.3, R@100: 99.4
2024-05-02 12:56:28   [5]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.1, R@5: 96.0, R@10: 97.3, R@100: 99.3
2024-05-02 12:58:21   [6]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.1, R@5: 96.0, R@10: 97.3, R@100: 99.4
2024-05-02 13:00:23   [7]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.1, R@5: 96.2, R@10: 97.2, R@100: 99.4
2024-05-02 13:02:23   [8]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 91.9, R@5: 95.9, R@10: 97.2, R@100: 99.4
2024-05-02 13:04:21   [9]: Recalls on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: R@1: 92.0, R@5: 95.9, R@10: 97.2, R@100: 99.3
2024-05-02 13:04:21   Average recall on < BaseDataset, pitts30k - #database: 10000; #queries: 6816 >: 91.99090375586854

it seems cross image encoder seems to have less impact when test dataloader shuffle=True.
and i think that this issue should be solved .

Invite your POSTER for our pre-heat for the CVPR offline conference in Seattle

We sincerely look forward to your poster and will arrange a collection (will show on bilibili)for all authors as a pre-heat for the offline conference. It is convenient for everyone to discuss the posters in advance and this will be a period of significant memory for everyone. You can also send us the exhibition information with poster.

Our email ：[email protected]
WeChat：AIDriver002

Release Code

Hello，when will you be releasing the code please?

How you generate the feature map in Fig.6?

Thanks for sharing your work! I wonder how you generate the visualization of feature maps in Fig.6 in the paper. Could you please give some detailed explanation of it (about which layer you used, how you combine the features and so on) or would you release relevant code?

Can you provide the dataset.py files for the msls dataset and the Tokyo 24/7 dataset

Why PCA?

Hi, thank you for the code!

Any specific reason the dimensionality reduction is done with PCA? I understand the flexibility of storing the PCA matrix and controlling the dimensionality after fitting, but a learnable linear projection at the end of the network may deliver better performance, did you test that?

Thanks!

Quesion about infer_batch_size

Thank you for sharing your remarkable work. I am truly impressed by the performance it shows. However, I have a question about the infer_batch_size.

The paper mentioned, "An inference batch contains 8 images for Pitts30K and 16 images for others". Given that CricaVPRNet is designed to encode along the "batch" dimension, I am curious whether the infer_batch_size or the image order within the dataset might affect the generated descriptor. If this is the case, could different test results be observed with varying infer_batch_size or if the test set is shuffled?

Please let me know if I have misunderstood or overlooked any details.

Questions for DINOv2

Hello,how does your code reduce features dim to 512,1024 after importing DINOv2? Gem pooling or adding a linear layer?

Could you please give me some help about the NordLand dataset?

I found the model preform very bad when using NordLand dataset while performing very good in other datasets, I download the dataset from here because the download url of this dataset in https://github.com/gmberton/VPR-datasets-downloader has been unavailable. Then I also use the left part of download_nordland.py to format this dataset. However, the eval result is very unsatisfying. Could you please give some help about Nordland dataset?