Coder Social home page Coder Social logo

awesome-video-captioning-1's Introduction

Awesome-Video-Captioning

A curated list of research papers in Video Captioning(from 2015 to 2020). Link to the code and project website if available.

Contents

Paper List

2015

  1. LSTM-P: Translating Videos to Natural Language Using Deep Recurrent Neural Networks
    Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko
    NAACL, 2015.[caffe-code]

  2. LRCN: Long-term Recurrent Convolutional Networks for Visual Recognition and Description
    Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell
    CVPR, 2015.[website]

  3. S2VT: Sequence to Sequence – Video to Text
    Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
    ICCV, 2015.[caffe-code]

  4. SA: Describing Videos by Exploiting Temporal Structure
    Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville
    ICCV, 2015.[theano-code] [tf-code]

2016

  1. LSTM-E: Jointly Modeling Embedding and Translation to Bridge Video and Language
    Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui
    CVPR, 2016.

  2. HRNE: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
    Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang
    CVPR, 2016.

  3. h-RNN: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
    Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, Wei Xu
    CVPR, 2016.

  4. MSR-VTT: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
    Jun Xu , Tao Mei , Ting Yao and Yong Rui
    CVPR, 2016.[website]

  5. BiLSTM: Video Description using Bidirectional Recurrent Neural Networks
    Álvaro Peris, Marc Bolaños, Petia Radeva, Francisco Casacuberta
    ICANN, 2016.

2017

  1. DenseVidCap: Weakly Supervised Dense Video Captioning
    Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, Xiangyang Xue
    CVPR, 2017.[tf-code]

  2. LSTM-TSA: Video Captioning with Transferred Semantic Attributes
    Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei
    CVPR, 2017.

  3. SCN: Semantic Compositional Networks for Visual Captioning
    Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng
    CVPR, 2017.[theano-code]

  4. StyleNet: StyleNet: Generating Attractive Visual Captions with Styles
    Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng
    CVPR, 2017.[pytorch-code]

  5. CT-SAN: End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
    Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim
    CVPR, 2017.[tf-code]

  6. CGVS: Top-down Visual Saliency Guided by Captions
    Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko
    CVPR, 2017.[tf-code]

  7. HBA: Hierarchical Boundary-Aware Neural Encoder for Video Captioning
    Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
    CVPR, 2017.[pytorch-code]

  8. TDDF: Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description
    Xishan Zhang, Ke Gao, Yongdong Zhang, Dongming Zhang, Jintao Li,and Qi Tian
    CVPR, 2017.

  9. GEAN: Supervising Neural Attention Models for Video Captioning by Human Gaze Data
    Youngjae Yu, Jongwook Choi, Yeonhwa Kim, Kyung Yoo, Sang-Hun Lee, Gunhee Kim
    CVPR, 2017.[tf-code]

  10. MM-Att: Attention-Based Multimodal Fusion for Video Description
    Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks
    ICCV, 2017.

  11. Tessellation: Temporal Tessellation: A Unified Approach for Video Analysis
    Dotan Kaufman, Gil Levi, Tal Hassner, Lior Wolf
    ICCV, 2017.[tf-code]

  12. MTEG: Multi-Task Video Captioning with Video and Entailment Generation
    Ramakanth Pasunuru, Mohit Bansal
    ACL, 2017.

  13. MAM-RNN: MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning
    Xuelong Li, Bin Zhao, Xiaoqiang Lu
    IJCAI, 2017.

  14. hLSTMat: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
    Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, Heng Tao Shen
    IJCAI, 2017.[theano-code]

2018

  1. Survey: Study of Video Captioning Problem
    Jiaqi Su
    cos598B, 2018.

  2. Fine-grained Video Captioning for Sports Narrative
    Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, Xiaokang Yang
    CVPR, 2018.

  3. TSA-ED: Interpretable Video Captioning via Trajectory Structured Localization
    Xian Wu, Guanbin Li Qingxing Cao, Qingge Ji, Liang Lin
    CVPR, 2018.

  4. RecNet: Reconstruction Network for Video Captioning
    Bairui Wang, Lin Ma, Wei Zhang, Wei Liu
    CVPR, 2018.[pytorch-code]

  5. M3: M3: Multimodal Memory Modelling for Video Captioning
    Junbo Wang, Wei Wang, Yan Huang, Liang Wang, Tieniu Tan
    CVPR, 2018.

  6. PickNet: Less Is More: Picking Informative Frames for Video Captioning
    Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang
    ECCV, 2018.

  7. ECO-SCN: ECO: Efficient Convolutional Network for Online Video Understanding
    Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox
    ECCV, 2018.[caffe-code] [pytorch-code]

  8. SibNet: SibNet: Sibling Convolutional Encoder for Video Captioning
    Sheng liu, Zhou Ren, Junsong Yuan
    ACM MM, 2018.

  9. TubeNet: Video Captioning with Tube Features
    Bin Zhao, Xuelong Li, Xiaoqiang Lu
    IJCAI, 2018.

2019

  1. Survey: Video Description: A Survey of Methods, Datasets and Evaluation Metrics
    Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, Mubarak Shah
    ACM Computing Surveys, 2019.

  2. GRU-EVE: Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, Ajmal Mian
    CVPR, 2019.

  3. MARN: Memory-Attended Recurrent Network for Video Captioning
    Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, Yu-Wing Tai
    CVPR, 2019.

  4. OA-BTG: Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
    Junchao Zhang, Yuxin Peng
    CVPR, 2019.

  5. VATEX: VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang
    ICCV, 2019.[website]

  6. POS: Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
    Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia
    ICCV, 2019.

  7. POS-CG: Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
    Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, Wei Liu
    ICCV, 2019.[pytorch-code]

  8. WIT: Watch It Twice: Video Captioning with a Refocused Video Encoder
    Xiangxi Shi, Jianfei Cai, Shafiq Joty, Jiuxiang Gu
    ACM MM, 2019.

  9. MGSA: Motion Guided Spatial Attention for Video Captioning
    Shaoxiang Chen and Yu-Gang Jiang
    AAAI, 2019.

  10. TDConvED: Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
    Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei
    AAAI, 2019.

  11. FCVC-CF&IA: Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention
    Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang,Kangnian Weng,Tao Zhang, Weiguo Fan
    AAAI, 2019.

  12. TAMoE: Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
    Xin Wang, Jiawei Wu, Da Zhang, Yu Su, William Yang Wang
    AAAI, 2019.[code]

  13. VIC: Video Interactive Captioning with Human Prompts
    Aming Wu, Yahong Han and Yi Yang
    IJCAI, 2019.[code]

2020

  1. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
    Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles
    CVPR, 2020.

  2. SAAT: Syntax-Aware Action Targeting for Video Captioning
    Zheng, Qi and Wang, Chaoyue and Tao, Dacheng
    CVPR, 2020.[pytorch-code]

  3. ORG-TRL: Object Relational Graph with Teacher-Recommended Learning for Video Captioning
    Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha
    CVPR, 2020.

  4. PMI-CAP: Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
    Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang
    ECCV, 2020.[pytorch-code]

  5. RMN: Learning to Discretely Compose Reasoning Module Networks for Video Captioning
    Ganchao Tan, Daqing Liu, Meng Wang and Zheng-Jun Zha
    IJCAI, 2020.[pytorch-code]

  6. SBAT: SBAT: Video Captioning with Sparse Boundary-Aware Transformer
    Tao Jin, Siyu Huang, Yingming Li, Zhongfei Zhang, Ming Chen
    IJCAI, 2020.

  7. Joint Commonsense and Relation Reasoning for Image and Video Captioning
    Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo
    AAAI, 2020.

  8. SMCG: Controllable Video Captioning with an Exemplar Sentence
    Yitian Yuan, Lin Ma, Jingwen Wang, Wenwu Zhu
    ACM MM, 2020.

  9. Poet: Poet: Product-oriented Video Captioner for E-commerce
    Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu
    ACM MM, 2020.

  10. Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
    Botian Shi, Lei Ji, Zhendong Niu, Nan Duan, Ming Zhou, Xilin Chen
    ACM MM, 2020.

Dense-Captioning

  1. Dense-Captioning Events in Videos
    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles
    ICCV, 2017.[code] [website]

  2. End-to-End Dense Video Captioning with Masked Transformer
    Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, Caiming Xiong
    CVPR, 2018.[pytorch-code]

  3. Attend and Interact: Higher-Order Object Interactions for Video Understanding
    Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf
    CVPR, 2018.

  4. Jointly Localizing and Describing Events for Dense Video Captioning
    Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei
    CVPR, 2018.

  5. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
    Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu
    CVPR, 2018.[tf-code]

  6. Move Forward and Tell: A Progressive Generator of Video Descriptions
    Yilei Xiong, Bo Dai, Dahua Lin
    ECCV, 2018.

  7. Adversarial Inference for Multi-sentence Video Description
    Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach
    CVPR, 2019.[pytorch-code]

  8. Dense Relational Captioning: Triple-stream Networks for Relationship-based Captioning
    Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, In So Kweon
    CVPR, 2019.[torch-code]

  9. Streamlined Dense Video Captioning
    Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han
    CVPR, 2019.

  10. Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning
    Tanzila Rahman, Bicheng Xu, Leonid Sigal
    ICCV, 2019.

  11. An Efficient Framework for Dense Video Captioning
    Maitreya Suin, A. N. Rajagopalan
    AAAI, 2020.

  12. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
    Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal
    ACL, 2020. [pytorch-code]

  13. Identity-Aware Multi-Sentence Video Description
    Jae Sung Park, Trevor Darrell, Anna Rohrbach
    ECCV, 2020.

Grounded-Captioning

  1. GVD: Grounded Video Description
    Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach
    CVPR, 2019.[pytorch-code]

  2. Relational Graph Learning for Grounded Video Description Generation
    Wenqiao Zhang, Xineric Wang, Siliang Tang, Haizhou Shi, Haochen Shi, Jun Xiao, Yueting Zhuang, Williamyang Wang
    ACM MM, 2020.

awesome-video-captioning-1's People

Contributors

tgc1997 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.