Coder Social home page Coder Social logo

syliz517 / clip-reid Goto Github PK

View Code? Open in Web Editor NEW
194.0 5.0 32.0 4.28 MB

Official implementation for "CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels" (AAAI 2023)

License: MIT License

Python 100.00%
clip reid

clip-reid's People

Contributors

syliz517 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

clip-reid's Issues

一个关于prompt训练代码的问题

作者您好,很抱歉打扰你。模型中PromptLearner的forward函数第一句是cls_ctx = self.cls_ctx[label]。这一句我不太明白,以market1501举例子,训练时self.cls_ctx是一个(751,4,512)的向量,这里batchsize=64的时候,这句代码就会把对应label的self.cls_ctx取出来,而第一阶段训练过程的self.cls_ctx在不断更新,那么就会分别对应到这751个人,也就是self.cls_ctx最后结果相当于是每个个体的prompt向量为(4,512),但是到了推理阶段又是新的750个人,那这个是怎么泛化的呢?

抱歉,问的问题可能有点愚蠢,我看了CoCoOp的论文和源码也没有看懂,而且任务也不太一样,希望能得到您的指点,best wishes

People tracking in videos?

Is it possible to use CLIP-ReID with YOLOv8n for people tracking in videos or it is worked only with datasets? I looked through your code and didn't find anything to work with videos.

Request for attention visualization

Hello, I appreciate your kind words about the excellent results and research sharing.

Regarding the Visualization of CLIP-ReID mentioned in the Ablation Studies and Analysis section of the paper by

Chefer, H.; Gur, S.; and Wolf, L. 2021 titled "Transformer interpretability beyond attention visualization" in the Proceedings of the IEEE/CVF CVPR, pages 782–791.

I would like to visualize my training results similar to what you did in Figure 3 by referring to your paper. Could I please get access to the code you used for visualization?

About the reid model of the vehicle

First of all, thank you very much for your contribution in the field of re-identification!
I had some problems when using your model. When I read the vehicle model into it, some tensor size mismatch problems were displayed. When training vehicle data, what should you do for the ViT backbone? What modifications were made?
image
image

Custom Dataset

HI! thank you for Your work. Do you have any guidance on how to train the model on custom dataset? Thanks

Failure to replicate results?

Use the veri data set to train and evaluate. The results are as follows:

2023-10-24 21:53:58,401 transreid.test INFO: Validation Results
2023-10-24 21:53:58,402 transreid.test INFO: mAP: 75.5%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-1 :92.0%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-5 :94.4%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-10 :95.9%

Rank_1 is 4.8% lower than in the paper...I double-checked the configs and ensured that the experiment settings were identical.

For reference, I'm attaching the training logs of the models:

train_log_cnn_prom_veri.txt

Missing Comparisons

Thanks for releasing this code for CLIP-based Re-ID. This is a good try for improving Re-ID. However, I find some key concerns:

  1. Overclaims
    In fact, your work is not the first work that adopts CLIP for Re-ID. Please check the following paper in MMSports ’22, October 10, 2022,
    Konrad Habel et al., CLIP-ReIdent: Contrastive Training for Player Re-Identification
    Besides, I think the part of related work is not full. In fact, there are many other Transformer-based methods should be discussed. For example, HAT ([HAT: Hierarchical Aggregation Transformers for Person Re-identification]) has already used multi-level supervison ( Similarly highligted in the last sentence of your Training details). LAFomer use local-aware tranformer for re-identification.
    It is better for the authors to modify these contents.

2.Missing the key comparisions
I appreciate the authors provide the ablations. However, what is the effect of using the multi-level supervison (Note that we also employ Ltri after the 11th transformer layer of ViT-B/16 and the 3rd residual layer of ResNet-50.)? In fact, this supervison generally shows better results than supervision with the last layer. This may lead to unfair comparisons.

Since the training need feeding all images. what are the training times and test speed with your devices (also not listed)?

Issue in evaluating the models

Hey, thanks for this excellent work of yours.
I have trained a model on the custom dataset, when I try to load the model for evaluation the script raises an error saying

  Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768])
  Position embedding resize to height:16 width: 8
  Traceback (most recent call last):
  File "test_clipreid.py", line 44, in <module>
  model.load_param_finetune(cfg.TEST.WEIGHT)
  File "/app/model/make_model_clipreid.py", line 173, in load_param_finetune
  self.state_dict()[i].copy_(param_dict[i])
  RuntimeError: The size of tensor a (129) must match the size of tensor b (211) at non-singleton dimension 0

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768]). This is during evaluation, whereas the position embedding size during the training is Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([211, 768]). Please can you check on this part?

The same is the case when I try to load the VeRi finetuned model and Market1501 model, with the scripts you have provided.

Error when training

HI!
if get the following error:
Снимок экрана 2023-10-04 в 21 06 24
As far as I understand, this error to custom dataset, but to the model architecture itself.

Fine-tune on new small dataset

Hi, I have a new small person re-id dataset(~100 id). And I want to fine-tune your models.
Should I fine-tune on both stage, or just fine-tune some epoch on stage 2?
Have you try merge all re-id dataset and training ?
thank you!

About training process

Hi. Thanks for your great work!

Can I ask about the explanation of the code execution?

If I want to reproduce the Market1501 result of your paper with CNN baseline,

do I first need to training img encoder with Strong re-ID method, using code below?

  • CUDA_VISIBLE_DEVICES=0 python train.py --config_file configs/person/cnn_base.yml

And I if have both pretrained image encoder and text encoder, does the below code run stage 1 training to optimize learnable tokens
and also stage 2 training?

  • CUDA_VISIBLE_DEVICES=0 python train_clipreid.py --config_file configs/person/cnn_clipreid.yml

But where is the text encoder training? Is it automatically loaded in the code?

Also, how should I test the CNN based model after training stage2.

Thanks in advance.

ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS

Hello, I met some problems, when I run your work. As you can see, when I run ViT-based CLIP-ReID+SIE+OLP for market1501, I got a bug "ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS", I cannot figure it out, can you tell how to solve it?
I just change
DATASETS:
NAMES: ('market1501')
ROOT_DIR: '../Market-1501-v15.09.15'
OUTPUT_DIR: '../market1501_out'

Thank you very much!

[Question] Unexpected Performance Drop with ViT/L14?

I have been playing about with your CLIP ReID model and I appreciate the effectiveness of your approach.

Recently, I conducted an experiment on Market-1501 to investigate whether we can further improve the performance of the model by using a larger model architecture. Specifically, I replaced the ViT-B16 backbone in the model with ViT/L14 (I changed the projection planes in make_clip_reid.py to make it work etc.). Intuitively, one might expect that a larger model would deliver better performance. However, the results were counterintuitive.

Here are the results obtained with the original ViT-B/16:

mAP: 89.8%
CMC curve, Rank-1  :95.3%
CMC curve, Rank-5  :98.6%
CMC curve, Rank-10 :99.2%

And here are the results obtained with the ViT-L/14:

mAP: 79.1%
CMC curve, Rank-1  :90.7%
CMC curve, Rank-5  :96.7%
CMC curve, Rank-10 :98.1%

It appears that the performance with the ViT/L14 architecture is significantly lower than with the ViT-B16. I double-checked the modifications and ensured that the experiment settings were identical, save for the architecture swap.

For reference, I'm attaching the training logs of both models:

train_log-market1501-V14.txt
train_log-market1501-B16.txt

I would greatly appreciate your insights into why the ViT/L14 architecture might underperform compared to ViT-B16 in this context. I am new to using ViT models in ReID so any guidance on how the model could potentially be fine-tuned for the larger architecture would also be appreciated!

text encoder is not fixed in first stage training

As the paper describe, in first stage the text and image encoder is fixed, only optimize the text tokens. However, in the code, it seems the text encoder is optimized during training. Could I ask if I misunderstand?

感谢您的工作!有一些困惑请教!

在代码中,第一阶段的训练中image encoder是冻结的,可学习的text tokens和和text encoder是可学习的。这和论文里描述的只有text tokens是可学习的,image encoder和text encoder是冻结的不匹配呀。

Training CLIP-ReID on a Custom Dataset: Player Re-identification Challenge

Hello CLIP-ReID maintainers,

First off, I want to thank you all for creating and maintaining this incredible repository.

I'm writing this issue to seek guidance on a particular aspect of using CLIP-ReID: training the model on a custom dataset. The dataset I'm interested in is from the 'Player Re-identification Challenge' repository, which you can find here.

I've gone through the code, but I couldn't find specific instructions on how to use a custom dataset for training. I have been able to train CLIP-ReID with the Market1501 dataset with no problem.

pre-trained CLIP-ReID for evaluation when having no train data

Hi,

How can I use a pre-trained CLIP-ReID model to evaluate or extract features on a custom dataset when I don't have data to train?

I know that I need to chnage an existing configuration file like vit_clipreid.yml and to my eval data. but any exmaple on how to run the evaluation script? when having no TEST.WEIGHT parameter.

Thanks

Can't load the weights of VehicleID

Hi, I am unable to load the weights of the vehicleID model Can you help me to solve it?

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([257, 768])
Position embedding resize to height:16 width: 16
Traceback (most recent call last):
  File "test_clipreid.py", line 42, in <module>
    model.load_param(cfg.TEST.WEIGHT)
  File "/Users/shreejaltrivedi/Documents/Repos/CLIP-ReID/model/make_model_clipreid.py", line 159, in load_param
    self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
RuntimeError: The size of tensor a (576) must match the size of tensor b (13164) at non-singleton dimension 0

Interesting work! Can I use your pre-trained model for my method?

I currently have a Prompt-CLIP work, which is a similar idea to yours, but at the moment, I've only experimented with CLIP-CNN, which is also proven to work. I've created my pseudo-text prompts for each identity in the six datasets. I am very inspired by the experimental results in your paper and would like to use your model for fine-tuning. I will be citing your paper in the future!

代码运行问题

Traceback (most recent call last):
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/train_clipreid.py", line 89, in
do_train_stage2(
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/processor/processor_clipreid_stage2.py", line 98, in do_train_stage2
loss = loss_fn(score, feat, target, target_cam, logits)
TypeError: loss_func() takes 3 positional arguments but 5 were given

请问一下,你们有没有遇到这个问题,如果有是怎么解决的?

About Fig3 in the paper

Hi. Thank you for sharing your work.

How did you visualize Figure3?! Did you

Could you also provide the code for it?

Duke数据集

请问文章被接收了吗,有Duke数据集会不会被要求删除这个数据集的试验

关于分布式训练

请问这份代码是否支持分布式训练?如果支持,应该怎么操作?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.