Coder Social home page Coder Social logo

vidsgg-big's Introduction

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Pytorch implementation of our paper Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, which is accepted by CVPR2022. (arXiv)

cvpr2022-6703.png

We also won the 1st place of Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021, with a simplified version of our model.(The code for object tracklets generation is available at here)

Requirements

Python == 3.7 or later, Pytorch == 1.6 or later, for other basic packages, just run the project and download whatever needed.

Datasets

Download the ImageNet-VidVRD dataset and VidOR dataset, and put them in the following folder as

├── dataloaders
│   ├── dataloader_vidvrd.py
│   └── ...
├── datasets
│   ├── cache                       # cache file for our dataloaders
│   ├── vidvrd-dataset
│   │   ├── train
│   │   ├── test
│   │   └── videos
│   ├── vidor-dataset
│   │   ├── annotation
│   │   └── videos
│   └── GT_json_for_eval
│       ├── VidORval_gts.json       # GT josn for evlauate
│       └── VidVRDtest_gts.json
├── tracking_results                # tracklets data & features
│   ├── ...
│   ├── format_demo.py              
│   └── readme.md   
├── experiments   
├── models
├── ...

Verify tracklets data & feature preparation by running dataloader_demo

This section helps you download the tracklets data and place them correctly, as well as set the dataloader's config correctly. Successfully run the tools/dataloader_demo.py to verify all data & configs are set correctly.

NOTE we use the term proposal in our code to represent tracklet proposals in video-level, which is totally different with the concept of "proposal" in "proposal-based methods" in our paper. In our paper, we use "proposals to represent paired subject-object tracklet segments. In contrast, here the term proposal in our code represents long-term object tracklets in video-level (i.e., without sliding window or video segments).

Tracklet data for VidVRD

  1. Download the tracklet with features at here: train, test. And put them in tracking_results/. Refer to tracking_results/readme.md for more details about the tracklet data.

  2. Download the tracklet with features used in "Beyond Short-Term Snippet: Video Relation Detection with Spatio-Temporal Global Context" at the author's personal page here.

    Some Notes

    • we use the term pku (i.e., Peking University) in our code to refer to their tracklets & features)
    • The original data released by them only have 999 .npy files (maybe they have updated the link now), missing data for video ILSVRC2015_train_00884000. So we trained our own Faster-RCNN (same training setting as the above paper), and extract the tracklet & features. And the supplemental data can be find here.
  3. The tracklet with features are in VidVRD_test_every1frames (ours), VidVRD_train_every1frames (ours), preprocess_data/tracking/videovrd_detect_tracking (PKU, both train & test), in whcih each .npy file corresponds to a video and contains all the tracklets in that video. The I3D features of tracklets are in preprocess_data/tracking/videovrd_i3d (PKU, both train & test). Put them under the dir of this project (or any other position if you use absolute path).

  4. modify the config file at experiments/demo/config_.py, where proposal_dir is the dir of tracklet with features, i3d_dir is the dir of tracklets' I3D features, and ann_dir is datasets/vidvrd-dataset.

  5. Verify all data & configs are set correctly. e.g., for PKU's tracklets with I3D features, run the following commands: (refer to tools/dataloader_demo.py for more details.):

    python tools/dataloader_demo.py \
            --cfg_path experiments/demo/config_.py \
            --split test \
            --dataset_class pku_i3d
    

Tracklet data for VidOR

  • Download the I3D feature of train & val videos (used for grounding stage) at here

    • we extract the I3D feature by following this repo (and also use their released pre-trained I3D weight).
  • Download the pre-prepared cache data for VidOR-val (here, around 19G), VidOR-train (here, 14 parts in total, around 126G), and put them in datasets/cache. (these cached data includes classeme features)

  • Some Notes

    Ideally, you can prepare these cache data from .npy files (as did in VidVRD). However, due to some ancient coding reasons, we extract bbox RoI feature for each frame, which makes these .npy files too large (827G for VidOR-train and 90G for VidOR-val). Therefore, we only release pre-prepared cache data as above.

    Despite this, we still release the .npy files without RoI features, i.e., only box positions, (here, around 12G), and you can extract their RoI features based on the position by yourself. Refer to tracking_results/readme.md for more details about the tracklet data.

    Refer to this repository VidVRD-tracklets (last Section of README.md) for more details about extracting features based on the given bbox positions.

    As for classeme feature, For VidOR, we use the weighted average of category word embeddings, based on the classification probability vectors predicted by the detector. ("soft" classeme) For VidVRD, we just use the category word embeddings as classeme feature, i.e., "hard" classeme. refer to tools_draft/extract_classeme.py for more details.

Evaluation:

First, make sure you run tools/dataloader_demo.py successfully

  1. first generate the GT json file for evaluation:

    for vidvrd:

    python VidVRD-helper/prepare_gts_for_eval.py \
        --dataset_type vidvrd \
        --save_path datasets/GT_json_for_eval/VidVRDtest_gts.json
    

    for vidor:

    python VidVRD-helper/prepare_gts_for_eval.py \
        --dataset_type vidor \
        --save_path datasets/GT_json_for_eval/VidORval_gts.json
    
  2. Download model weights for different exps here, and put them in the experiments/ dir. Download pre-prepared data here, and put them in the prepared_data/ dir.

  3. Refer to experiments/readme.md for the correspondence between the exp ids and the table ids in our paper.

  4. For VidVRD, run the following commands to evaluate different exps: (refer to tools/eval_vidvrd.py for more details)

    e.g., for exp1

    python tools/eval_vidvrd.py \
        --cfg_path experiments/exp1/config_.py \
        --ckpt_path experiments/exp1/model_epoch_80.pth \
        --use_pku \
        --cuda 1 \
        --save_tag debug
    
  5. For VidOR, refer to tools/eval_vidor.py for more details.

    Run the following commands to evaluate BIG-C (i.e., only the classification stage):

    python tools/eval_vidor.py \
        --eval_cls_only \
        --cfg_path experiments/exp4/config_.py \
        --ckpt_path experiments/exp4/model_epoch_60.pth \
        --save_tag epoch60_debug \
        --cuda 1
    

    Run the following commands to evaluate BIG based on the output of cls stage (you need run BIG-C first and save the infer_results).

    python tools/eval_vidor.py \
        --cfg_path experiments/grounding_weights/config_.py \
        --ckpt_path experiments/grounding_weights/model_epoch_70.pth \
        --output_dir experiments/exp4_with_grounding \
        --cls_stage_result_path experiments/exp4/VidORval_infer_results_topk3_epoch60_debug.pkl \
        --save_tag with_grd_epoch70 \
        --cuda 1
    

    Run the following commands to evaluate the fraction recall (refer to table-6 in our paper, you need run BIG first and save the hit_infos).

    python tools/eval_fraction_recall.py \
        --cfg_path experiments/grounding_weights/config_.py \
        --hit_info_path  experiments/exp5_with_grounding/VidORval_hit_infos_aft_grd_with_grd_epoch70.pkl
    

NOTE

  • We also provide another evaluation scripts (i.e., tools/eval_vidvrd_our_gt.py and tools/eval_vidor_our_gt.py). The main difference lies in the process of constructing GT tracklets (i.e., from frame-level bbox annotations to video-level tracklets GTs). Compared to VidVRD-helper's GTs, here we perform linear interpolation for fragmented GT tracklets. Consequently the evaluation results have slight differences.
  • Nevertheless, the results reported in our paper are evaluated with VidVRD-helper's GTs (i.e., tools/eval_vidvrd.py and tools/eval_vidor.py) to ensure fair comparisons.
  • In our paper, all scores are truncated to 4 decimal places (not rounded)

Training

  1. For VidVRD, run the following commands to train for different exps: (refer to tools/train_vidvrd.py for more details)

    e.g., for exp1

    CUDA_VISIBLE_DEVICES=0,1 python tools/train_vidvrd.py \
        --cfg_path experiments/exp1/config_.py \
        --use_pku \
        --save_tag retrain
    
  2. For VidOR, refer to tools/train_vidor.py.py for more details

    Run the following commands to train BIG-C (i.e., only the classification stage). e.g., for exp4

    CUDA_VISIBLE_DEVICES=0,1 python tools/train_vidor.py \
        --cfg_path experiments/exp4/config_.py \
        --save_tag retrain
    

    Note that we pre-assign all the labels for Base-C (exp6) since it does not require bipartite matching between predicate and GTs. The label assignment takes around 1.5 hours.

    Run the following commands to train the grounding stage:

    CUDA_VISIBLE_DEVICES=2,3 python tools/train_vidor.py \
        --train_grounding \
        --cfg_path experiments/grounding_weights/config_.py \
        --save_tag retrain
    

Data Release Summarize

  • model weights for all exps (google drive, here)

  • pre-prepared data (statical prior & category text Glove embddings) (dcd MEGA cloud, here)

  • I3D feature of VidOR train & val around 3.3G (dcd MEGA cloud, here)

  • VidOR traj .npy files (OnlyPos) (this has been released, around 12G) here (gkf google drive).refer to this repository: VidVRD-tracklets

  • VidVRD traj .npy files (with feature) around 20G

    • VidVRD-test , gkf zju cloud, here (3.87G) (deprecated)
    • VidVRD-test, MEGA cloud, here (1.42G) (same file as that in zju cloud, but is zipped).
    • VidVRD-train dcd MEGA cloud here (13G)
  • cache file for train & val (for vidor)

    • v9 for val (around 19G) (dcd MEGA cloud here)
    • v7 for train (14 parts, around 126G in total here)
  • [Update]: Because the link of pku data provided by pku's author isn't available anymore. We upload their data into our MEGA cloud: here (6.21G)

  • [Update]: For the Tracklet detector's weight:

    • for VidOR dataset, it has been released in our previous repo: VidVRD-tracklets, and can be downloaded here
    • for VidVRD dataset, download here (NOTE that this weight is the re-trained version, not the same as that was used in our paper)
    • for the training data of the Tracklet detector, refer to the Appendix of our paper

(dcd is just the name of MEGA cloud account of our Lab :) )

Citation

If our work is helpful for your research, please cite our publication:

@inproceedings{gao2021classification,
  title={Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs},
  author={Gao, Kaifeng and Chen, Long and Niu, Yulei and Shao, Jian and Xiao, Jun},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2022}
}

Others

I have been working on this project for more than one year. So when learning and using pytorch, I wrote a lot of APIs (utils/utils_func.py), in which some of them might be interesting and useful, e.g., unique_with_idx_nd. So I opened a new repo to collect them, here

vidsgg-big's People

Contributors

dawn-lx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vidsgg-big's Issues

tracklets with features link expired

Hello! I tried to download the tracklets with features from the author's website http://www.muyadong.com/publication.html
under this paper
Beyond Short-Term Snippet: Video Relation Detection with Spatio-Temporal Global Context, but the link has expired.
Are there any ways to get the data? i have read that you trained your own RCNN to get the tracklet with features for one file that was missing. Should I do the same? Maybe you have the downloaded dataset of features saved somewhere?

Make the qualitative results

Thank you for your sharing the source code!
I would like to know how we can run experiments from Figure 6 in the paper.
And about the VidVRD Dataset, can we get qualitative results as visualization results?

No such file or directory

Hello, where can I get the file prepared_data/vidvrd_EntiNameEmb_pku.npy and prepared_data/pred_bias_matrix_vidvrd_pku.npy? I find no way to generate them by myself. Than you very much!

About inference results

Hi, when I evaluate grounding stage on VidOR by running eval_vidor.py, I find inference result has been load before evaluation, e,i,, VidORval_infer_results_topk3_epoch60_debug.pkl'. Could you introduce the structure of the data of the file?
Thank you very much!

About the pre-prepared cache data for VidOR

Hi, could you introduce the structure of the pre-prepared cache data for VidOR, for example, MEGAv7_VidORtrain_freq1_part01_th_15-180-200-0.40.pkl?

Besides, when I trainning grounding stage, I have a questions about the code of DEBUG.
In models/grd_model_v5.py, line 268, does the inter_dura means the time slot that subject and object are all appearing in the trajectory? And what's the meaning of the index_map?

Thank you very much!

About the pre-prepared cache data for VidOR

Hi, could you introduce the structure of the pre-prepared cache data for VidOR, for example, MEGAv7_VidORtrain_freq1_part01_th_15-180-200-0.40.pkl?

Besides, when I trainning grounding stage, I have a questions about the code of DEBUG.
In models/grd_model_v5.py, line 268, does the inter_dura means the time slot that subject and object are all appearing in the trajectory? And what's the meaning of the index_map?

Thank you very much!

Pre-prepared cache data for VidOR

您好,请问能提供一下数据集VidOR的pre-prepared cache data的百度云或者其他的下载方式吗?由于文件太大MEGA上的数据无法下载完成,谢谢。
cache

Device information

When I loaded all the cached VidOR data and dataloader started to fork, the program crashed due to error OSError: [Errno 12] Cannot allocate memory. I tried lower number of threads, but it did not work.

I am wondering whether my memory is too low (189GB, occupied memory is about 150+GB before crashing). Could you please provide your device information?

Thank you!

Box shifting: some boxes may appear as background after tracking (when using dataloader_vidor.py)

Tips from @Dawn-LX :

This problem originates from

for idx,box_info in enumerate(track_res):
if not isinstance(box_info,list):
box_info = box_info.tolist()
assert len(box_info) == 6 or len(box_info) == 12 + self.dim_boxfeature,"len(box_info)=={}".format(len(box_info))
frame_id = box_info[0]
tid = box_info[1]
tracklet_xywh = box_info[2:6]
xmin_t,ymin_t,w_t,h_t = tracklet_xywh
xmax_t = xmin_t + w_t
ymax_t = ymin_t + h_t
bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t]
confidence = float(0)
if len(box_info) == 12 + self.dim_boxfeature:
confidence = box_info[6]
cat_id = box_info[7]
xywh = box_info[8:12]
xmin,ymin,w,h = xywh
xmax = xmin+w
ymax = ymin+h
bbox = [(xmin+xmin_t)/2, (ymin+ymin_t)/2, (xmax+xmax_t)/2,(ymax+ymax_t)/2]

Here, we notice that tracking results for each box at one specific frame consist of a 6-dim vector or a (12+dim_boxfeature)-dim vector.

  1. If the 6-dim vector appears, corresponding box will be viewed as background.
  2. Otherwise, the first 12-dim of box_info, which consists of frame_id, tracklet_id, 4-dim bbox coordinates, confidence, category_id, 4-dim bbox coordinates, will be used to determine the final location of bbox.

The first 4-dim bbox coordinates (box_info[2:6]) is generated by tracker, and the second one box_info[8:12] is generated by our video obeject detector. The reason why box shift is that we calculate an average bbox coordinates by the two mentioned one. Because detected object location maybe inconsistent with current tracklet, and the tracker-generated one is more precise, so this averaging manner may merge two boxes to a background one.

Specifically, box generated by tracker is much more precise since it considers boxes in previous frames, current detected box, and visual similarity. But box from video object detector maybe wrongly linked to current tracklet (which does not mean it is a background box itself). So this averaging manner is not strictly correct in these cases and that is why we only use track-generated one (box_info[2:6]) in

tracklet_xywh = box_info[2:6]
xmin_t,ymin_t,w_t,h_t = tracklet_xywh
xmax_t = xmin_t + w_t
ymax_t = ymin_t + h_t
confidence = box_info[6]
bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t,confidence]
cat_id = box_info[7]
# xywh = box_info[8:12]
.

However, tracklet_mAP does not improve by switching from averaging manner to unique manner. The reasons maybe

  1. Cases of box shifting are rarely seen, so final performance benefits little from this fixing.
  2. Averaging manner may serve as a more precise way to combine/choose these two kinds of boxes for most cases, so unique manner may lose some accuracy.

About VidVRD dataset

Hello, could you provide your extracted tracklet data VidVRD_test_every1frames? Thank you very much.

About classme feature

According to tools_draft/extract_classme.py, I run the tools_draft/construct_CatName2vec.py first. But no file named vidor_CatName2vec_dict.pkl is generated. Could you help me?
Thank you!

轨迹提取的模型和细节

您好,我想做一些轨迹方面的消融实验,能提供一下您当时使用deepSORT方法时所用到的参数吗,另外还有提取物体的deepSORT feature时所用到的模型链接

Tracklet Data of VidVRD

Hello,
The link for downloading the tracklet data VidVRD_test_every1frames is invalid. Could you please provide the new one?
Thank you very much.

Mis-matching between Trajectory and gt_graph

Hi, I find that the cat_ids of the trajectory proposal is different from the traj_cat_ids of the paired gt_graph.
For example:
proposal.cat_ids: tensor([ 4, 4, 24, 4, 31, 31]), gt_graph.traj_cat_ids: tensor([11, 11, 7, 35, 35])
There is no common trajectory categories between each video-gt_graph pairs. But I think there should be the same category.
So I want to know how to obtain the concrete catetory of a trajectory proposal or gt_graph.
Thanks!

About the prepared data for Vidor

Hello, when I try to look the structure for "MEGAv7_VidORtrain_freq1_part01_th_15-180-200-0.40.pkl", and I try pickle.load, but it got an error as "_pickle.UnpicklingError: pickle data was truncated". I want to know how can I use these prepared data and the structure of it. Thank a lot.

DEBUG model

I find this snippet in grd_model_v5.py, which is used for DEBUG model to predict time boundaries. But, in regression head, why the left (right) offset is mapping to [0, 1] by sigmoid ? And how do you transform time boundaries to video frame feature sequences ?

        temp = nn.Sequential(
            DepthWiseSeparableConv1d(self.dim_hidden,self.dim_hidden,3),
            nn.ReLU()
        )
        temp2 = [copy.deepcopy(temp) for _ in range(4)] \
            + [DepthWiseSeparableConv1d(self.dim_hidden,self.num_bins,3)]
        temp3 = [copy.deepcopy(temp) for _ in range(4)] \
            + [DepthWiseSeparableConv1d(self.dim_hidden,2*self.num_bins,3),nn.Sigmoid()]
        
        self.cls_head = nn.Sequential(*temp2)
        self.conf_head = copy.deepcopy(self.cls_head)
        self.regr_head = nn.Sequential(*temp3)

About the prepared data

Hello, when I download the prepared data, I find the VidVRD test data from zju needs the permission, and the pku data is no longer available. Could you please update the link? Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.