syliz517 / clip-reid Goto Github PK

Official implementation for "CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels" (AAAI 2023)

License: MIT License

Python 100.00%

clip reid

clip-reid's People

Contributors

Stargazers

Watchers

clip-reid's Issues

作者您好，很抱歉打扰你。模型中PromptLearner的forward函数第一句是cls_ctx = self.cls_ctx[label]。这一句我不太明白，以market1501举例子，训练时self.cls_ctx是一个（751,4,512）的向量，这里batchsize=64的时候，这句代码就会把对应label的self.cls_ctx取出来，而第一阶段训练过程的self.cls_ctx在不断更新，那么就会分别对应到这751个人，也就是self.cls_ctx最后结果相当于是每个个体的prompt向量为（4,512），但是到了推理阶段又是新的750个人，那这个是怎么泛化的呢？

抱歉，问的问题可能有点愚蠢，我看了CoCoOp的论文和源码也没有看懂，而且任务也不太一样，希望能得到您的指点，best wishes

Will using arcface loss performs better than CE ?

People tracking in videos?

Is it possible to use CLIP-ReID with YOLOv8n for people tracking in videos or it is worked only with datasets? I looked through your code and didn't find anything to work with videos.

Request for attention visualization

Hello, I appreciate your kind words about the excellent results and research sharing.

Regarding the Visualization of CLIP-ReID mentioned in the Ablation Studies and Analysis section of the paper by

Chefer, H.; Gur, S.; and Wolf, L. 2021 titled "Transformer interpretability beyond attention visualization" in the Proceedings of the IEEE/CVF CVPR, pages 782–791.

I would like to visualize my training results similar to what you did in Figure 3 by referring to your paper. Could I please get access to the code you used for visualization?

Training with .CSV Files: Image Paths and Text Descriptions

Hi there,

is there a way to train with a .csv file that has image paths and text descriptions?

RuntimeError: The size of tensor a (152) must match the size of tensor b (160) at non-singleton dimension 0

I use this pth : VeRi_clipreid_12x12sie_ViT-B-16_60.pth, then I get this problem so I change stride size from [16, 16] to [12, 12] still the problem ,is there anything I need to change?

About the reid model of the vehicle

First of all, thank you very much for your contribution in the field of re-identification!
I had some problems when using your model. When I read the vehicle model into it, some tensor size mismatch problems were displayed. When training vehicle data, what should you do for the ViT backbone? What modifications were made?

Custom Dataset

HI! thank you for Your work. Do you have any guidance on how to train the model on custom dataset? Thanks

Failure to replicate results？

Use the veri data set to train and evaluate. The results are as follows：

2023-10-24 21:53:58,401 transreid.test INFO: Validation Results
2023-10-24 21:53:58,402 transreid.test INFO: mAP: 75.5%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-1 :92.0%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-5 :94.4%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-10 :95.9%

Rank_1 is 4.8% lower than in the paper...I double-checked the configs and ensured that the experiment settings were identical.

For reference, I'm attaching the training logs of the models:

train_log_cnn_prom_veri.txt

Missing Comparisons

Thanks for releasing this code for CLIP-based Re-ID. This is a good try for improving Re-ID. However, I find some key concerns:

Overclaims
In fact, your work is not the first work that adopts CLIP for Re-ID. Please check the following paper in MMSports ’22, October 10, 2022,
Konrad Habel et al., CLIP-ReIdent: Contrastive Training for Player Re-Identification
Besides, I think the part of related work is not full. In fact, there are many other Transformer-based methods should be discussed. For example, HAT ([HAT: Hierarchical Aggregation Transformers for Person Re-identification]) has already used multi-level supervison ( Similarly highligted in the last sentence of your Training details). LAFomer use local-aware tranformer for re-identification.
It is better for the authors to modify these contents.

2.Missing the key comparisions
I appreciate the authors provide the ablations. However, what is the effect of using the multi-level supervison (Note that we also employ Ltri after the 11th transformer layer of ViT-B/16 and the 3rd residual layer of ResNet-50.)？ In fact, this supervison generally shows better results than supervision with the last layer. This may lead to unfair comparisons.

Since the training need feeding all images. what are the training times and test speed with your devices (also not listed)?

Will L2 normalization for image and text leads to better results?

When aligning image and text, why don't you need to l2 normalize the image and text features? Will this not cause the module length of the image feature to become very large in order to reduce the i2t loss in the second stage of training?

Issue in evaluating the models

Hey, thanks for this excellent work of yours.
I have trained a model on the custom dataset, when I try to load the model for evaluation the script raises an error saying

  Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768])
  Position embedding resize to height:16 width: 8
  Traceback (most recent call last):
  File "test_clipreid.py", line 44, in <module>
  model.load_param_finetune(cfg.TEST.WEIGHT)
  File "/app/model/make_model_clipreid.py", line 173, in load_param_finetune
  self.state_dict()[i].copy_(param_dict[i])
  RuntimeError: The size of tensor a (129) must match the size of tensor b (211) at non-singleton dimension 0

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768]). This is during evaluation, whereas the position embedding size during the training is Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([211, 768]). Please can you check on this part?

The same is the case when I try to load the VeRi finetuned model and Market1501 model, with the scripts you have provided.

Error when training

HI!
if get the following error:

As far as I understand, this error to custom dataset, but to the model architecture itself.

Is it possible to apply CLIP-ReID to visible-infrared person re-identification?

Fine-tune on new small dataset

Hi, I have a new small person re-id dataset(~100 id). And I want to fine-tune your models.
Should I fine-tune on both stage, or just fine-tune some epoch on stage 2?
Have you try merge all re-id dataset and training ?
thank you!

About training process

Hi. Thanks for your great work!

Can I ask about the explanation of the code execution?

If I want to reproduce the Market1501 result of your paper with CNN baseline,

do I first need to training img encoder with Strong re-ID method, using code below?

CUDA_VISIBLE_DEVICES=0 python train.py --config_file configs/person/cnn_base.yml

And I if have both pretrained image encoder and text encoder, does the below code run stage 1 training to optimize learnable tokens
and also stage 2 training?

CUDA_VISIBLE_DEVICES=0 python train_clipreid.py --config_file configs/person/cnn_clipreid.yml

But where is the text encoder training? Is it automatically loaded in the code?

Also, how should I test the CNN based model after training stage2.

Thanks in advance.

ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS

Hello, I met some problems, when I run your work. As you can see, when I run ViT-based CLIP-ReID+SIE+OLP for market1501, I got a bug "ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS", I cannot figure it out, can you tell how to solve it?
I just change
DATASETS:
NAMES: ('market1501')
ROOT_DIR: '../Market-1501-v15.09.15'
OUTPUT_DIR: '../market1501_out'

Thank you very much!

[Question] Unexpected Performance Drop with ViT/L14?

I have been playing about with your CLIP ReID model and I appreciate the effectiveness of your approach.

Recently, I conducted an experiment on Market-1501 to investigate whether we can further improve the performance of the model by using a larger model architecture. Specifically, I replaced the ViT-B16 backbone in the model with ViT/L14 (I changed the projection planes in make_clip_reid.py to make it work etc.). Intuitively, one might expect that a larger model would deliver better performance. However, the results were counterintuitive.

Here are the results obtained with the original ViT-B/16:

mAP: 89.8%
CMC curve, Rank-1  :95.3%
CMC curve, Rank-5  :98.6%
CMC curve, Rank-10 :99.2%

And here are the results obtained with the ViT-L/14:

mAP: 79.1%
CMC curve, Rank-1  :90.7%
CMC curve, Rank-5  :96.7%
CMC curve, Rank-10 :98.1%

It appears that the performance with the ViT/L14 architecture is significantly lower than with the ViT-B16. I double-checked the modifications and ensured that the experiment settings were identical, save for the architecture swap.

For reference, I'm attaching the training logs of both models:

train_log-market1501-V14.txt
train_log-market1501-B16.txt

I would greatly appreciate your insights into why the ViT/L14 architecture might underperform compared to ViT-B16 in this context. I am new to using ViT models in ReID so any guidance on how the model could potentially be fine-tuned for the larger architecture would also be appreciated!

text encoder is not fixed in first stage training

As the paper describe, in first stage the text and image encoder is fixed, only optimize the text tokens. However, in the code, it seems the text encoder is optimized during training. Could I ask if I misunderstand?

感谢您的工作！有一些困惑请教！

在代码中，第一阶段的训练中image encoder是冻结的，可学习的text tokens和和text encoder是可学习的。这和论文里描述的只有text tokens是可学习的，image encoder和text encoder是冻结的不匹配呀。

Training CLIP-ReID on a Custom Dataset: Player Re-identification Challenge

Hello CLIP-ReID maintainers,

First off, I want to thank you all for creating and maintaining this incredible repository.

I'm writing this issue to seek guidance on a particular aspect of using CLIP-ReID: training the model on a custom dataset. The dataset I'm interested in is from the 'Player Re-identification Challenge' repository, which you can find here.

I've gone through the code, but I couldn't find specific instructions on how to use a custom dataset for training. I have been able to train CLIP-ReID with the Market1501 dataset with no problem.

pre-trained CLIP-ReID for evaluation when having no train data

Hi,

How can I use a pre-trained CLIP-ReID model to evaluate or extract features on a custom dataset when I don't have data to train?

I know that I need to chnage an existing configuration file like vit_clipreid.yml and to my eval data. but any exmaple on how to run the evaluation script? when having no TEST.WEIGHT parameter.

Thanks

Thanks for your Great Work！Suppose I want to add a visual prompt learning based on your model，Will the model perform better?

idea from this paper. And can you give me some guidence... thanks again!

Compared to fast-reid

Thanks for releasing CLIP-based Re-ID code. I'm doing a work related to person reid and followed the code in https://github.com/JDAI-CV/fast-reid/ . Comparing with the results in https://github.com/JDAI-CV/fast-reid/blob/master/MODEL_ZOO.md, it looks like that the results of CLIP-REID doesn't outperform CNN-based baseline's?

Is it still possible to use the text encoder to generate a vector from sentence ?

I would Like to ask that are there a way to access to the text encoder for this model or I can just simply use the encoder directly from CLIP at the same Model type. BTW, I am so sory for creating an issue; It quite confuse to find your text encoder in the code.

Can't load the weights of VehicleID

Hi, I am unable to load the weights of the vehicleID model Can you help me to solve it?

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([257, 768])
Position embedding resize to height:16 width: 16
Traceback (most recent call last):
  File "test_clipreid.py", line 42, in <module>
    model.load_param(cfg.TEST.WEIGHT)
  File "/Users/shreejaltrivedi/Documents/Repos/CLIP-ReID/model/make_model_clipreid.py", line 159, in load_param
    self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
RuntimeError: The size of tensor a (576) must match the size of tensor b (13164) at non-singleton dimension 0

How to use CLIP-ReID as feature extractor?

Hi! Would first of all like to know whether your are okay with me implementing these models here: https://github.com/mikel-brostrom/yolo_tracking. Then I would also like to know if there is any easy way of extracting features with these models. Keep up the great work!

Interesting work! Can I use your pre-trained model for my method?

I currently have a Prompt-CLIP work, which is a similar idea to yours, but at the moment, I've only experimented with CLIP-CNN, which is also proven to work. I've created my pseudo-text prompts for each identity in the six datasets. I am very inspired by the experimental results in your paper and would like to use your model for fine-tuning. I will be citing your paper in the future!

Center loss being ignored

I have noticed that you do not use center loss, despite setting up optimizer for it and adding centroid scaling.

CLIP-ReID/processor/processor_clipreid_stage2.py

Line 13 in 5b92124

def do_train_stage2(cfg,

It's just not being called with loss function.
Can I ask why? I have implemented it myself, but I am yet to see how it works

代码运行问题

Traceback (most recent call last):
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/train_clipreid.py", line 89, in
do_train_stage2(
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/processor/processor_clipreid_stage2.py", line 98, in do_train_stage2
loss = loss_fn(score, feat, target, target_cam, logits)
TypeError: loss_func() takes 3 positional arguments but 5 were given

请问一下，你们有没有遇到这个问题，如果有是怎么解决的？

About Fig3 in the paper

Hi. Thank you for sharing your work.

How did you visualize Figure3?! Did you

Could you also provide the code for it?

syliz517 / clip-reid Goto Github PK

clip-reid's People

Contributors

Stargazers

Watchers

Forkers

clip-reid's Issues

Recommend Projects

Recommend Topics

Recommend Org