Coder Social home page Coder Social logo

adv-inf's Introduction

Adversarial Inference for Multi-Sentence Video Descriptions

This is the implementation of Adversarial Inference for Multi-Sentence Video Descriptions

This repository is based on self-critical.pytorch. Thank you Ruotian for the code! The modifications are:

  • Training Multimodal Generator and Hybrid Discriminator in models/.
  • Adversarial Inference in eval_utils.py

Requirements

Clone the repository recursively. git clone --recursive https://github.com/jamespark3922/adv-inf

Python 2.7 (because there is no coco-caption version for python 3)
PyTorch 0.4 (along with torchvision)
densevid_eval (for activitynet evaluation)
java to run meteor.jar file

Training on ActivityNet Dense Captions

Download ActivityNet captions and preprocess them

We share the input labels and features in this folder. (Scripts to preprocess the labels will be available soon.)

Features

  • renext101-64f (126GB) extracted from r3d repository
  • resnet152 (14GB), extracted 100 frames for each video
  • bottomup labels (16GB) with confidence score, extracted 3 frames for each clip

After downloading them all, unzip them to your preferred feature directory.

Note that mean-pooling operations are done when loading the data in dataloader.py

Training

python train.py --caption_model video --input_json activity_net/inputs/video_data_dense.json --input_fc_dir activity_net/feats/resnext101-64f/ --input_img_dir activity_net/feats/resnet152/ --input_box_dir activity_net/feats/bottomup/ --input_label_h5 activity_net/inputs/video_data_dense_label.h5 --glove_npy activity_net/inputs/glove.npy --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path video_ckpt --val_videos_use -1 --losses_print_every 10 --batch_size 16 --language_eval 1

Context: The generator model uses the hidden state of previous sentence as "context", starting at epoch --g_context_epoch.

Evaluation

After training is done, evaluate the captions in paragraph level. Note the evaluation is done on val1 set.

The normal inference using greedymax or beamsearch can be run with the following command:

python eval.py --g_model_path video_ckpt/gen_best.pth --infos_path video_ckpt/infos.pkl --d_model_path video_ckpt/dis_best.pth --sample_max 1 --id $id --beam_size $beam_size

and will be saved in densevid_eval/caption_$id.json. You can also disable --d_model_path if you do not wish to score and evaluate the discriminator.

Adversarial Inference

Sampling $num_samples sentences and choosing the best one with discriminator can be run with

python eval.py --g_model_path video_ckpt/gen_best.pth --infos_path video_ckpt/infos.pkl --d_model_path video_ckpt/dis_best.pth --sample_max 0 --num_samples $num_samples --temperature $temperature --id $id

Generated Catpions

You can run the language metrics to reproduce the results

python para-evaluate.py -s $submission_file --verbose

and the diversity metrics (Div-N, Re-N) in paper.

python evaluateCaptionsDiversity.py $submission_file

Reference

@article{park2019advinf,
  title= Adversarial Inference for Multi-Sentence Video Descriptions,
  author={Park, Jae Sung and Rohrbach, Marcus and Darrell, Trevor and Rohrbach, Anna},
  jorunal={CVPR 2019},
  year={2019}
}

adv-inf's People

Contributors

jamespark3922 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

adv-inf's Issues

Codes on generating features

Hi, Park. Very excellent work. Coudl you please post the codes on generating the three kinds of features like 'renext101-64f (126GB)', 'resnet152 (14GB)' and 'bottomup labels (16GB)' ?

generated captions

Hi @jamespark3922,

Could you share the generated captions of the models used in the paper? Including the ones from references [13, 76, 67].

Best,
Jie

Unable to understand feature extraction part

Hi, I am not able to understand how you have extracted the renext101-64f features. I am trying to extract features using the repository link provided here (https://github.com/kenshohara/video-classification-3d-cnn-pytorch). But, for each 16 frame it is giving me feature vector of 2048 dimenson. But your vectors seems to have 4096 dimension vectors for each 16 frames.

For resnet152 you have used 100 frames per video. Can you please explain how you have extracted those 100 frames. Also, resnet152 has vector size of 2048 but your vector are having dimension of 1024.

It would be very helpful if you can help me with this.

result_val_D7B9L9.json

[Errno 2] No such file or directory: 'densevid_eval/result_val_D7B9L9.json'

How can I have this file?

Normal Gan training.

Does this repo contain the option to train in normal adversarial mode? If not, can i kindly request to share it ?

Gan.

Hi @jamespark3922 ,

Are you considering to release the code for training it as GAN? It will be a very interesting code base, and will be very complementary to Luo's. In case you don't plan to release it maybe you could share it with me via e-mail?

Regards,
panos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.