microsoft / 2d-tan Goto Github PK
View Code? Open in Web Editor NEWVideoX: a collection of video cross-modal models
License: Other
VideoX: a collection of video cross-modal models
License: Other
Thanks for nice job. @penghouwen,@Sy-Zhang
In Table I, some method do not using the vgg features in their papers, hence you replace their original feature with vgg or just copy the results from their paper?
Another question: fc6 layer of VGG16?
Thank you again for your wonderful work!
Hi,In Google Drive,there are just the extract features of tacos and Charades-STA datasets,Do you have the extract features of activitynet dataset??Thanks
Hi,
I've implemented X-CLIP as a fork of 🤗 HuggingFace Transformers, and we are planning to add it to the library soon (see huggingface/transformers#18852). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1upFMg-FPNP_D8dxeYWTju6lpYldZk8AJ?usp=sharing
I really like the simplicity of X-CLIP, which is the main reason I decided to add it :)
As you may or may not know, each model on the HuggingFace hub has its own git repository. For example, the xclip-base-patch32 checkpoint can be found here. If you check the "files and versions" tab, you can find the converted weights of the model. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!
A model card can also be added to the repo, which is just a README.
If you haven't done so, would you be interested in joining the Microsoft organisation on the hub, such that we can store all model checkpoints there (rather than under my username)? This also enables you (and your co-authors) to have write access to the X-CLIP models on the hub, so you can edit the model cards, add new models etc.
Let me know!
Kind regards,
Niels
ML Engineer @ HuggingFace
Hello, thank you for your great work, but I have some questions.
1、In engine.py, I see some state "on_start_epoch, on_sample, on_end_epoch, on_test_sample", but they only appear once, I can't figure out what's their function, can you explain it?
2、I see you set MAX_EPOCH to 100, but I find the performance on test set has not improved obviously since around 20 epochs, more epochs only improve the performance on training set. Do you have the same situation during your taining time?
As described in paper: "Specifically, videos are decoded at 25 fps and the output of the last average pooling layer are extracted for every 16 consecutive frames. Therefore, each video clip corresponds to 0.64 second". Take TACoS for example, fps is 29.4 in train.json, I am confused about how to decode a video in 25fps? Did you discard some frames? If we decode a video by its original fps, we will get a 16/29.4 time unit. Looking forward to your reply, thanks!
Hi,i am interesting in your multi-scale 2D-TAN model because your result on TACoS improved significantly and the paper link point this code.But it seems two version share same code and i reproduced it get a worse performence.I wander if you update this code for your MS-2D-TAN model
Hi, songyang, when I want to run Multi-scale 2d-tan, the command is as follows: python moment_localization/run.py --cfg experiments/charades/MS-2D-TAN-G-VGG.yaml --verbose
, but an error occurs: AttributeError: 'EasyDict' object has no attribute 'TAG'
, so, what's wrong??
When using h5py.File("tall_c3d_features.hdf5", "r") to read the extracted features for TACos dataset, got error as follows:
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 72876032, sblock->base_addr = 0, stored_eof = 485560344)
while I can open other hdf5 files normally, e.g. PCA_activitynet_v1-3.hdf5 can be read correctly.
Is the TACos feature file corrupted? If so, can you please reupload tall_c3d_features.hdf5?
Thanks a lot!
Hi,@penghouwen,@Sy-Zhang Thanks for sharing the nice job!
I had a question: should the txt be ordered in the convert_vgg_features_to_hdf5.py? If these texts not ordered, we cannot get the right video temporal information.
Please correct me if I am wrong. Thank you again for your nice work!
Hi, Thank you for your wonderful work. i was trying to download all the requirements to reproduce the results. I have following trivial queries please;
1: Where and how can we use decord in this code base.
2: If possible can you elaborate the arrangment of dataset. what i understood for the option 2 is that we need to download zipped data.
3: Where in zipped files are we using train, test label files.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "main.py", line 369, in
main(config)
File "main.py", line 121, in main
train_one_epoch(epoch, model, criterion, optimizer, lr_scheduler, train_loader, text_labels, config, mixup_fn)
File "main.py", line 203, in train_one_epoch
scaled_loss.backward()
File "/root/miniconda3/envs/env/lib/python3.7/contextlib.py", line 119, in exit
next(self.gen)
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/handle.py", line 123, in scale_loss
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/scaler.py", line 184, in unscale_with_stashed
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/scaler.py", line 148, in unscale_with_stashed_python
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/scaler.py", line 22, in axpby_check_overflow_python
File "/root/miniconda3/envs/env/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 5423) is killed by signal: Segmentation fault.
Killing subprocess 4388
Hi there, i got a dataloader error above during some iterations of epoch. Do yo have any idea about that ?
These are part of parameters:
python -m torch.distributed.launch --nproc_per_node=1 main.py -cfg configs/k600/16_8.yaml --output . --accumulation-steps 2 --resume /data/xxxx/xclip/VideoX-master/X-CLIP/pretrained_models/k600_16_8.pth
batch size is 8
Hi
I want to to ask if for activitynet you have reported the results for val ot test set? (table 2)?
Thanks!!!
Is this link expired? Thanks! http://ai2-website.s3.amazonaws.com/data/Charades_v1_features_rgb.tar.gz
Hi @Sy-Zhang!
Appreciate for providing a nicely organized codebase.
I am confused about the data split regarding the TACoS dataset.
While your paper indicates that it follows the data split of TALL (Gao et al. 2017), I found they are not the same.
The data split in TALL is 50:25:25 (proportion) and your code is 75/27/25 (actual number), which is obviously different.
It would be more clear if you clarify the one practitioners to follow.
Many thanks.
Hi, thanks for your great paper.
In the paper Fig.2, it looks like the "Video-specific Prompting" use the output of "Multi-frame Integration Transformer" as visual feature input.
But in the implement code, you send the output "img_features" of "Cross-frame Communication Transformer" into "Video-specific Prompting".
Is the picture on the paper wrong?
Hello ,thanks for your sharing!
I run your code, and always with the problem:(for three datasets)
File "moment_localization/train.py", line 295, in
scheduler=scheduler)
File "/home/zq/reproduce/2D-TAN/moment_localization/../lib/core/engine.py", line 43, in train
self.hook('on_update', state)
File "/home/zq/reproduce/2D-TAN/moment_localization/../lib/core/engine.py", line 8, in hook
self.hooksname
File "moment_localization/train.py", line 203, in on_update
val_state = engine.test(network, iterator('val'), 'val')
File "/home/zq/reproduce/2D-TAN/moment_localization/../lib/core/engine.py", line 60, in test
for sample in state['iterator']:
File "/home/zq/anaconda3/envs/2dtan/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 576, in next idx, batch = self._get_batch()
File "/home/zq/anaconda3/envs/2dtan/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 553, in _get_batch success, data = self._try_get_batch()
File "/home/zq/anaconda3/envs/2dtan/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 33782) exited unexpectedly
my env:
Ububtu 16.04.06
cuda 9.0
python 3.7.5
torch 1.1.0
looking forward to your reply
Thank you for your solid work.
My GPU memory slowly increased when the model was training. I wonder if there is any memory leakage in the code?
Hello ,您代码更新完了嘛?我测试的时候报错
Traceback (most recent call last):
File "moment_localization/test.py", line 20, in
from core.utils import AverageMeter
File "/data/gao/2D-TAN/moment_localization/../lib/core/utils.py", line 8, in
from pathlib import Path
ImportError: No module named pathlib
这是什么,您可以告诉我一下.thankyou
According to the code, ActivityNet-v1.3 (released in 2016) is used, and ActivityNet Captions (released in 2021) is mentioned to be used in paper. However, these two datasets are not the same datasets. I am really confused about it.
Could you please help me with it? Thank you very much.
Hi @Sy-Zhang
Thank you for your excellent work!
I am confused about the annotation in TACoS, cause some of the annotations you provide are different from officially provided.
For example, for the video s13-d21.avi in train.json, timestamp [252, 686] appears in your annotations but not in the original annotations. I‘d appreciate it if you could explain that.
Hi, thanks for your great paper.
In Table 3 of this paper, the zero-shot performance of ActionCLIP is 40.8% and 58.3%, but according to Figure 3 of the ActionCLIP, the zero-shot performance of these two datasets is about 50% and 70%.
Is there any difference in implementation?
Hi!
In your paper released in Arxiv, it said that
we sequentially feed the word embeddings into a three-layer bidirectional LSTM network.
However, in the experiment setting files under the experiments
folder, TAN/FUSION_MODULE/PARAMS/LSTM/BIDIRECTIONAL
is all set to False
.
May I ask the reason of it?
Hi, may I ask about the method of feature extraction.
Paper shows that the method is VGG16, but in the code, the features have been already given from the dataset, take Charades as example, it used Charades_v1_features_rgb.tar.gz.
For my own dataset, do I need to train a two-stream net (like Two-Stream features (RGB Stream) at first to extract the feature of the data? And then use the 2d-tan to localize the tasks.
The provided link only contain processed feature for Tacos and Charades-STA dataset, so for the activitynet dataset, did you use the c3d feature officially provided by activitynet challenge?
Hi,
Can you clarify more about the TARGET_STRIDE argument?
Thanks
I meet this problem as followes
" RuntimeError: CUDA out of memory. Tried to allocate 1.41 GiB (GPU 0; 11.91 GiB total capacity; 11.04 GiB already allocated; 203.06 MiB free; 121.00 MiB cached)"
When i run this program as introduced, Gpu 0 is available.
How can solve this issues? thank you
sovled
hi,
Thank you for your excellent work! I would like to ask if the original frame image you used is the 24fps on the charades' official website? (only Charades dataset) hope to get your reply ! thanks !
Best,
jun
Hi, as #9 said, I could download extracted features of activitynet. But I want know how to use the file pac_activitynet_v1-3.hdf5
beacuse I want use this dataset with Charades-STA's model. Thank you very much if you can help me.
Hello!
I am trying to train the 2D-TAN network with my own extracted features of TACoS and I got the following error:
`Traceback (most recent call last):
File "moment_localization/train.py", line 297, in
scheduler=scheduler)
File "/home/share/wangzilong2/2D-TAN/moment_localization/../lib/core/engine.py", line 41, in train
state['optimizer'].step(closure)
File "/home/share/wangzilong2/home/share/wangzilong2/anaconda3/envs/2D-TAN/lib/python3.7/site-packages/torch/optim/adam.py", line 58, in step
loss = closure()
File "/home/share/wangzilong2/2D-TAN/moment_localization/../lib/core/engine.py", line 30, in closure
loss, output = state'network'
File "moment_localization/train.py", line 154, in network
prediction, map_mask = model(textual_input, textual_mask, visual_input)
File "/home/share/wangzilong2/home/share/wangzilong2/anaconda3/envs/2D-TAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/share/wangzilong2/2D-TAN/moment_localization/../lib/models/tan.py", line 20, in forward
vis_h = self.frame_layer(visual_input.transpose(1, 2))
File "/home/share/wangzilong2/home/share/wangzilong2/anaconda3/envs/2D-TAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/share/wangzilong2/2D-TAN/moment_localization/../lib/models/frame_modules/frame_pool.py", line 18, in forward
vis_h = torch.relu(self.vis_conv(visual_input))
File "/home/share/wangzilong2/home/share/wangzilong2/anaconda3/envs/2D-TAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/share/wangzilong2/home/share/wangzilong2/anaconda3/envs/2D-TAN/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 196, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected 3-dimensional input for 3-dimensional weight 256 2048 1 94911601578208 94911602548768 94911602553032, but got 6-dimensional input of size [256, 2048, 8, 2, 7, 7] instead`
My features (for each video, feature size is 64 * 2048) partially resemble the TACoS c3d features provided originally (for each video, feature size is x * 4096, where x stands for the varying number of clips used to extract features I think? but the point is they are different for each video) and they are both in hdf5 format. I have no idea what is wrong even after searching through the web, please help, thank you!
Hello
Thanks for sharing this amazing work.
I have a question regarding table 2, The original MCN paper didn't publish results on Activitynet Captions dataset, thus I assume you have re-evaluated their model again on this dataset, my question is did you follow their same setting in the sense of dividing each video into 5 seconds segments and thus each moment candidate now composed of any continuous number of segments?
Best
Thanks for such a great work. I want to know how to generate proposals for temporal action localization via sparse 2d temporal adjacent network. Is it to set different strides in the original conv2? thank you.
Hello, did you try to fix the random seed in the code to obtain the same results when running the same code? I made some attempts as follows:
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
I add the set_seed() function in train.py, but I cannot obtain the same results when I run the same code without any changes, do you know what's the problem?
Hello
I am trying to reproduce the results mentioned in the paper for the activitynet c3d features.
and I am not getting the same results, the results I am getting are as follows :
https://imgur.com/a/AYrJ4hi
I didnt edit the expermints config file at all (by the way is the 2D-TAN-64x64-K9L4-pool.yaml file supposed to give the results mentioned in table 3) ?
Anyidea what might be wrong?
also I didn't find any refrences for the loss function defined in : https://github.com/microsoft/2D-TAN/blob/master/lib/models/loss.py
so how are you exactly calculating the loss?
Hello, thanks for the wonderful work!
So I have some features that I extracted by myself and would like to train and test the network using those features, I wonder if you could let me know how to do this? I'm still using ActivityNet, Charades-STA and TACoS dataset but with different features. It would be even better if you could explain how to train the network on some completely different datasets!
Thanks in advance!
May I ask is there any implementation of S-2D-TAN for HACS on github? Really thank you.
Hello, I read your paper recently and your works were so amazing that I really want to reproduce it. But I have trouble in downloading the visual features from google drive becuase I don't have authority. I want to apply for authority for the visual features.
hi, in the paper, you provide the upper bound results on activitynet captioning
dataset, I want to know how to calculate the upper bound results, thanks!!!
Thank you for kindly sharing! And I am curious about how much computational resources are needed for training the model and the corresponding training time. Because it has a relatively large scale of parameters as introduced in the paper.
What is your model for extracting visual features?
Hi
can you explain the intuition behind the need of scaling the IoU values between 0 and 1 in the loss function?
Thanks
Hi,
I would like to ask what is the relation between your proposed cross-frame attention and the one in IFC [1] and TEViT [2], I consider none of the above papers is cited. In addition, the text token to ReferFormer (cvpr 22).
As the cross-frame communication transformer is considered as a major contribution of the paper, I need to raise a AIV concern.
[1] Video Instance Segmentation using Inter-Frame Communication Transformers
[2] Temporally Efficient Vision Transformer for Video Instance Segmentation
I run the code for ActivityNet Captions dataset and it takes 3 hours for one epoch on one RTX 2080Ti. I want to confirm the time and the number of epoches.
Greetings!
I'm currently trying to run and train this model, but I've encountered some problems relating to the dataset.
I've followed the instruction to download the Charades_v1_features_rgb.tar.gz from the official website and converted it to charades_vgg_rgb.hdf5. However, the error occurred saying that I don't have the file vgg_rgb_features.hdf5.
I wonder do I have to download this from the box drive link in the README.md? Since the download always seems to fail, I want to know whether it will work the same if I change the output file name to vgg_rgb_features.hdf5 instead of charades_vgg_rgb.hdf5 in the convert_vgg_features_to_hdf5.py?
Hi, Sy. I am trying to run MS-2D-TAN with python moment_localization/train.py --cfg experiments/tacos/MS-2D-TAN-G-VGG.yaml --dataDir data/ --verbose
but get the following error:
Traceback (most recent call last):
File "moment_localization/train.py", line 77, in <module>
args = parse_args()
File "moment_localization/train.py", line 43, in parse_args
update_config(args.cfg)
File "/media/jpl/T7/MS-2D-TAN/lib/core/config.py", line 105, in update_config
_update_dict(config[k], v)
File "/media/jpl/T7/MS-2D-TAN/lib/core/config.py", line 96, in _update_dict
raise ValueError("{} not exist in config.py".format(k))
ValueError: DATA_DIR not exist in config.py
For clarification, since my environment reported error on import. I changed the code in train.py
(but I dont think the problem is here):
import sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from lib import models
from lib import datasets
from lib.core.config import config, update_config
...
Sorry to disturb you. Really excellent work!
May I ask what Conv models to use for extracting frame features from raw videos?
Hi, Thanks for sharing such good work.
I just found the onedirve link for feature download is expired. Could you repair it asap ?
Regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.