lizhou-cs / jointnlt Goto Github PK

The official implementation for the CVPR 2023 paper Joint Visual Grounding and Tracking with Natural Language Specification.

License: MIT License

Python 100.00%

jointnlt's Introduction

👋 Hi, I’m Li Zhou
👀 I’m interested in CV, Studied in HIT(ShenZhen)
🌱 I’m currently learning tracking and multi modailty learning.
💞️ I’m looking to collaborate on Multi-modality learning, SOT, NLT.
📫 How to reach me [email protected]

jointnlt's People

Contributors

Stargazers

Watchers

Forkers

568xiaoma xuboyue1999

jointnlt's Issues

How do I prepare the OTB dataset?

Dear Zhou,

Thanks for your excellent work!

How do I prepare the OTB dataset?
How to download the OTB_query_test, OTB_query_train, and OTB_videos?
--OTB_sentences
|-- OTB_query_test
|-- OTB_query_train
|-- OTB_videos

Thank you!

Regarding the issue of only providing language and img for testing

Hello, I would like to ask if it is possible to input only the language description and not the bounding box of the first frame during the testing phase? For example, entering a video and its language description without providing any truth values when evaluating.
I try three methods: TEST_METHOD: "TRACK" # choice in ['GROUND', 'TRACK', 'JOINT']
But it did't work. It still wants me to provide the groundtruth file.
Exception: Could not read file D:/JointNLT-main-2/output//OTB_videos/Threecar/groundtruth.txt

transform

Whether there is a conflict between the code and the comment, what exactly to use when training?

# can't use RandomHorizontalFlip it would change the pic would not fit the language description.,
transform_joint = tfm.Transform(tfm.ToGrayscale(probability=0.05), tfm.RandomHorizontalFlip(probability=0.5))

Sudden performance degradation during the training stage !!!!

Hello, thanks for your great job!
I followed the steps provided to train the model from scratch (swin_b_ep300.yaml, with 4 3090 GPU). But I found that the model suddenly experienced a sudden drop in performance after 70 epochs of training (as shown in the figure below), and before the drop, the model's iou (sum) performance could reach 2.4 on the training set.

I tried to continue training a batch based on the provided model (JointNLT_ep0300.pth.tar), it was found that the model's iou (sum) performance can reach around 2.7 on the training set.

So, the model was not fully trained in 70 epochs. I'm curious why there was a sudden drop in performance. May I ask if there are any details that need to be noted? (I noticed that the name of provided model (JointNLT_ep0300.pth.tar) is called TransNLT, but the model' name in the open source code is called JointNLT)

Looking forward to your reply. Thanks again for your great work.

Predictions when there is no object

Hello,
I am applying trained weights in 'NL mode' on different datasets. The model still predicts bounding boexes when there is no object, what do you suggest to apply as changes in such cases or when model is not confident?

Thanks in advance

How can I train the network on my custom dataset?

Hello, thank you very much for your work!
I have created my own dataset based on the OTB dataset format, but I have not found a way to train my dataset on your network.
The following command line is provided in the readme file for training:
python tracking/train.py --script jointnlt --config swin_b_ep300 --save_dir log/swin_ep300 --mode multiple --nproc_per_node 4
But it does not show the dataset used or the location of the training set.
How can I train the network on my custom dataset?

Automatic stopping of training

ValueError:Network outputs is NAN! Stop Training
Looking forward to your reply!

Environment issues

I get this following error running inference on OTB sequences dataset.

Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your torchvision version with torchvision.__version__ and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install. Traceback (most recent call last): File "tracking/../lib/test/evaluation/running.py", line 138, in run_sequence output = tracker.run_sequence(seq, debug=debug) File "tracking/../lib/test/evaluation/tracker.py", line 88, in run_sequence output = self._track_sequence(tracker, seq, init_info) File "tracking/../lib/test/evaluation/tracker.py", line 143, in _track_sequence out = tracker.track(image, info) File "tracking/../lib/test/tracker/jointnlt.py", line 130, in track out_dict = self.network.forward_test(self.text_dict, self.template_dict, search_patch, temporal) File "tracking/../lib/models/JointNLT.py", line 196, in forward_test return self.forward_joint(text_src, text_mask, template_src, template_mask, search_src, search_mask, temporal) File "tracking/../lib/models/JointNLT.py", line 168, in forward_joint roi_feature = self.get_target_feature(search_tokens, pred_boxes) File "tracking/../lib/models/JointNLT.py", line 210, in get_target_feature target_roi_feature = self.roi(opt_feat, boxes) File "/anaconda/envs/joint/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/anaconda/envs/joint/lib/python3.7/site-packages/torchvision/ops/roi_align.py", line 86, in forward return roi_align(input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned) File "/anaconda/envs/joint/lib/python3.7/site-packages/torchvision/ops/roi_align.py", line 55, in roi_align _assert_has_ops() File "/anaconda/envs/joint/lib/python3.7/site-packages/torchvision/extension.py", line 34, in _assert_has_ops "Couldn't load custom C++ ops. This can happen if your PyTorch and " RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your torchvision version with torchvision.__version__ and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

I prepared a conda env with joint.yml.
It had two issues, so I had to change their versions manually.

external=2.0 version doesn't exist.
clip=1.0

temporal and roi_queue no use in inference phase?

template is no used in training code?

I look through the training code, and want to confirm that whether settings about template variable are not used ?
and i am confused with this code paragraph:

JointNLT/lib/train/actors/jointnltActor.py

Lines 98 to 102 in 2ed1715

    
           jump_flag = False 
        
           try: 
        
               self.processing(data, data['grounding_frames_path'], ground_dict['pred_boxes'], image_coords) 
        
           except ValueError: 
        
               jump_flag = True

in try, there is no return, so what's the purpose of running codes within try context?

after run any command :local.py are initialized

Hello, thank you very much for your work.
Before running, I have entered all the path addresses in the following file：lib/test/evaluation/local.py
I encountered a problem while evaluating. Whenever I run the following command:
python tracking/test.py jointnlt swin_b_ep300 --dataset otb --threads 16 --num_gpus 4 --params__model JointNLT_ep0300.pth.tar

all the contents in this file: lib/test/evaluation/local.py are initialized！！！ so an error is reported as follows:
RuntimeError: YOU HAVE NOT SETUP YOUR local.py!!!
Go to "tracking..\lib\test\evaluation\local.py" and set all the paths you need. Then try to run again.

I have tried many times and now I don't know how to continue. Thank you for your answer.

Training Time Confirmation

I saw another issue has mentioned this, and it is closed. But i am curious that the whole training time won't be so short under 4*3090, because I find the number of model parameters is about 200M that is not a small level even larger than some base model (they are about 90~100M paremeters). In my expericence, 300 epochs can not be finished within 3.5 days?

Could you please provide the download link of OTB99

Dear authors:
Could you provide the download link of OTB99.
Thanks!

Network is not of correct type

尝试在训练时加载JointNLT_ep0300.pth.tar权重模型，因checkpoint_dict['net_type']为transnlt，与net类型jointnlt不一致，报错：Network is not of correct type. 代码中没有找到transnlt相关的yaml文件，请问该如何解决？

OTB_NL model for evalution

Thanks for your excellent work.
I noticed that you provided the pre-trained models for evaluation: "JointNLT_OTB_NL.pth.tar" and "JointNLT_ep300.pth.tar". I think "JointNLT_OTB_NL.pth.tar" is specifically for evaluating on otb99 only via NL. I would like to know how the training strategy for this model differs from the "JointNLT_ep300.pth.tar" model.
I would greatly appreciate it if I could get a reply from you.

LaSOTText Dataset

I understand that LaSOTText might be the extension of the LaSOT dataset. If this is the case, I have downloaded the latest extension subset from the official source i.e. https://1drv.ms/u/s!Akt_zO4y_u6DgoQrvo5h48AC15l67A?e=Zo6PWx, but I was unable to find the nlp.txt file within the provided files.
Your guidance in this matter would be greatly appreciated, as it would help me utilize the dataset more effectively in my research.

About some training problems of grounding and tracking sub-task ?

作者您好，请问方便提供模型训练的log文件吗？或者麻烦告诉我一下：grounding需要训练到多少个epoch后才能稳定，即grounding预测框与GT的IoU>阈值(0.5)，另外地，tracking是在第几个epoch才能正常被训练，tracking loss不再需要被置为0 ？

麻烦作者能在有空时回复我一下，万分感谢！！！

Training Details

Hello, Could you tell me the training details (GPUs, batch_size, GPU‘s memory consumption, tarining time)

Raw results issues

Hi, thanks for your work.

Since your raw results on datasets are on your homepage, I download them for research.
But I found some issues in your raw results, which NL initializes.
The raw result of the first frame of each sequence is the same as that of the first frame of groundtruth. That differs from how to get the result of the first frame you mentioned in your paper, which is obtained by grounding.

I hope to receive some answers about this. Thank you! @lizhou-cs

	jump_flag = False
	try:
	self.processing(data, data['grounding_frames_path'], ground_dict['pred_boxes'], image_coords)
	except ValueError:
	jump_flag = True