syliz517 / clip-reid Goto Github PK
View Code? Open in Web Editor NEWOfficial implementation for "CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels" (AAAI 2023)
License: MIT License
Official implementation for "CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels" (AAAI 2023)
License: MIT License
作者您好,很抱歉打扰你。模型中PromptLearner的forward函数第一句是cls_ctx = self.cls_ctx[label]。这一句我不太明白,以market1501举例子,训练时self.cls_ctx是一个(751,4,512)的向量,这里batchsize=64的时候,这句代码就会把对应label的self.cls_ctx取出来,而第一阶段训练过程的self.cls_ctx在不断更新,那么就会分别对应到这751个人,也就是self.cls_ctx最后结果相当于是每个个体的prompt向量为(4,512),但是到了推理阶段又是新的750个人,那这个是怎么泛化的呢?
抱歉,问的问题可能有点愚蠢,我看了CoCoOp的论文和源码也没有看懂,而且任务也不太一样,希望能得到您的指点,best wishes
Is it possible to use CLIP-ReID with YOLOv8n for people tracking in videos or it is worked only with datasets? I looked through your code and didn't find anything to work with videos.
Hello, I appreciate your kind words about the excellent results and research sharing.
Regarding the Visualization of CLIP-ReID mentioned in the Ablation Studies and Analysis section of the paper by
Chefer, H.; Gur, S.; and Wolf, L. 2021 titled "Transformer interpretability beyond attention visualization" in the Proceedings of the IEEE/CVF CVPR, pages 782–791.
I would like to visualize my training results similar to what you did in Figure 3 by referring to your paper. Could I please get access to the code you used for visualization?
Hi there,
is there a way to train with a .csv file that has image paths and text descriptions?
First of all, thank you very much for your contribution in the field of re-identification!
I had some problems when using your model. When I read the vehicle model into it, some tensor size mismatch problems were displayed. When training vehicle data, what should you do for the ViT backbone? What modifications were made?
HI! thank you for Your work. Do you have any guidance on how to train the model on custom dataset? Thanks
Use the veri data set to train and evaluate. The results are as follows:
2023-10-24 21:53:58,401 transreid.test INFO: Validation Results
2023-10-24 21:53:58,402 transreid.test INFO: mAP: 75.5%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-1 :92.0%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-5 :94.4%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-10 :95.9%
Rank_1 is 4.8% lower than in the paper...I double-checked the configs and ensured that the experiment settings were identical.
For reference, I'm attaching the training logs of the models:
Thanks for releasing this code for CLIP-based Re-ID. This is a good try for improving Re-ID. However, I find some key concerns:
2.Missing the key comparisions
I appreciate the authors provide the ablations. However, what is the effect of using the multi-level supervison (Note that we also employ Ltri after the 11th transformer layer of ViT-B/16 and the 3rd residual layer of ResNet-50.)? In fact, this supervison generally shows better results than supervision with the last layer. This may lead to unfair comparisons.
Since the training need feeding all images. what are the training times and test speed with your devices (also not listed)?
When aligning image and text, why don't you need to l2 normalize the image and text features? Will this not cause the module length of the image feature to become very large in order to reduce the i2t loss in the second stage of training?
Hey, thanks for this excellent work of yours.
I have trained a model on the custom dataset, when I try to load the model for evaluation the script raises an error saying
Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768])
Position embedding resize to height:16 width: 8
Traceback (most recent call last):
File "test_clipreid.py", line 44, in <module>
model.load_param_finetune(cfg.TEST.WEIGHT)
File "/app/model/make_model_clipreid.py", line 173, in load_param_finetune
self.state_dict()[i].copy_(param_dict[i])
RuntimeError: The size of tensor a (129) must match the size of tensor b (211) at non-singleton dimension 0
Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768])
. This is during evaluation, whereas the position embedding size during the training is Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([211, 768])
. Please can you check on this part?
The same is the case when I try to load the VeRi finetuned model and Market1501 model, with the scripts you have provided.
Hi, I have a new small person re-id dataset(~100 id). And I want to fine-tune your models.
Should I fine-tune on both stage, or just fine-tune some epoch on stage 2?
Have you try merge all re-id dataset and training ?
thank you!
Hi. Thanks for your great work!
Can I ask about the explanation of the code execution?
If I want to reproduce the Market1501 result of your paper with CNN baseline,
do I first need to training img encoder with Strong re-ID method, using code below?
And I if have both pretrained image encoder and text encoder, does the below code run stage 1 training to optimize learnable tokens
and also stage 2 training?
But where is the text encoder training? Is it automatically loaded in the code?
Also, how should I test the CNN based model after training stage2.
Thanks in advance.
Hello, I met some problems, when I run your work. As you can see, when I run ViT-based CLIP-ReID+SIE+OLP for market1501, I got a bug "ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS", I cannot figure it out, can you tell how to solve it?
I just change
DATASETS:
NAMES: ('market1501')
ROOT_DIR: '../Market-1501-v15.09.15'
OUTPUT_DIR: '../market1501_out'
Thank you very much!
I have been playing about with your CLIP ReID model and I appreciate the effectiveness of your approach.
Recently, I conducted an experiment on Market-1501 to investigate whether we can further improve the performance of the model by using a larger model architecture. Specifically, I replaced the ViT-B16 backbone in the model with ViT/L14 (I changed the projection planes in make_clip_reid.py to make it work etc.). Intuitively, one might expect that a larger model would deliver better performance. However, the results were counterintuitive.
Here are the results obtained with the original ViT-B/16:
mAP: 89.8%
CMC curve, Rank-1 :95.3%
CMC curve, Rank-5 :98.6%
CMC curve, Rank-10 :99.2%
And here are the results obtained with the ViT-L/14:
mAP: 79.1%
CMC curve, Rank-1 :90.7%
CMC curve, Rank-5 :96.7%
CMC curve, Rank-10 :98.1%
It appears that the performance with the ViT/L14 architecture is significantly lower than with the ViT-B16. I double-checked the modifications and ensured that the experiment settings were identical, save for the architecture swap.
For reference, I'm attaching the training logs of both models:
train_log-market1501-V14.txt
train_log-market1501-B16.txt
I would greatly appreciate your insights into why the ViT/L14 architecture might underperform compared to ViT-B16 in this context. I am new to using ViT models in ReID so any guidance on how the model could potentially be fine-tuned for the larger architecture would also be appreciated!
As the paper describe, in first stage the text and image encoder is fixed, only optimize the text tokens. However, in the code, it seems the text encoder is optimized during training. Could I ask if I misunderstand?
在代码中,第一阶段的训练中image encoder是冻结的,可学习的text tokens和和text encoder是可学习的。这和论文里描述的只有text tokens是可学习的,image encoder和text encoder是冻结的不匹配呀。
Hello CLIP-ReID maintainers,
First off, I want to thank you all for creating and maintaining this incredible repository.
I'm writing this issue to seek guidance on a particular aspect of using CLIP-ReID: training the model on a custom dataset. The dataset I'm interested in is from the 'Player Re-identification Challenge' repository, which you can find here.
I've gone through the code, but I couldn't find specific instructions on how to use a custom dataset for training. I have been able to train CLIP-ReID with the Market1501 dataset with no problem.
Hi,
How can I use a pre-trained CLIP-ReID model to evaluate or extract features on a custom dataset when I don't have data to train?
I know that I need to chnage an existing configuration file like vit_clipreid.yml and to my eval data. but any exmaple on how to run the evaluation script? when having no TEST.WEIGHT parameter.
Thanks
idea from this paper. And can you give me some guidence... thanks again!
Thanks for releasing CLIP-based Re-ID code. I'm doing a work related to person reid and followed the code in https://github.com/JDAI-CV/fast-reid/ . Comparing with the results in https://github.com/JDAI-CV/fast-reid/blob/master/MODEL_ZOO.md, it looks like that the results of CLIP-REID doesn't outperform CNN-based baseline's?
I would Like to ask that are there a way to access to the text encoder for this model or I can just simply use the encoder directly from CLIP at the same Model type. BTW, I am so sory for creating an issue; It quite confuse to find your text encoder in the code.
Hi, I am unable to load the weights of the vehicleID model Can you help me to solve it?
Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([257, 768])
Position embedding resize to height:16 width: 16
Traceback (most recent call last):
File "test_clipreid.py", line 42, in <module>
model.load_param(cfg.TEST.WEIGHT)
File "/Users/shreejaltrivedi/Documents/Repos/CLIP-ReID/model/make_model_clipreid.py", line 159, in load_param
self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
RuntimeError: The size of tensor a (576) must match the size of tensor b (13164) at non-singleton dimension 0
Hi! Would first of all like to know whether your are okay with me implementing these models here: https://github.com/mikel-brostrom/yolo_tracking. Then I would also like to know if there is any easy way of extracting features with these models. Keep up the great work!
I currently have a Prompt-CLIP work, which is a similar idea to yours, but at the moment, I've only experimented with CLIP-CNN, which is also proven to work. I've created my pseudo-text prompts for each identity in the six datasets. I am very inspired by the experimental results in your paper and would like to use your model for fine-tuning. I will be citing your paper in the future!
I have noticed that you do not use center loss, despite setting up optimizer for it and adding centroid scaling.
Traceback (most recent call last):
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/train_clipreid.py", line 89, in
do_train_stage2(
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/processor/processor_clipreid_stage2.py", line 98, in do_train_stage2
loss = loss_fn(score, feat, target, target_cam, logits)
TypeError: loss_func() takes 3 positional arguments but 5 were given
请问一下,你们有没有遇到这个问题,如果有是怎么解决的?
Hi. Thank you for sharing your work.
How did you visualize Figure3?! Did you
Could you also provide the code for it?
请问文章被接收了吗,有Duke数据集会不会被要求删除这个数据集的试验
请问这份代码是否支持分布式训练?如果支持,应该怎么操作?
Why apply triplet loss to img_feature_last? here, img_feature_last is the output of the second-to-last module of the ViT model.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.