Coder Social home page Coder Social logo

soonet's Introduction

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

arXiv

This repository is an official implementation of SOONet. SOONet is an end-to-end framework for temporal grounding in long videos. It manages to model an hours-long video with one-time network execution, alleviating the inefficiency issue caused by the sliding window pipeline.

Framework

πŸ“’ News

  • [2023.9.29] Code is released.
  • [2023.7.14] Our paper has been accepted to ICCV 2023!

πŸš€ Preparation

1. Install dependencies

The code requires python and we recommend you to create a new environment using conda.

conda create -n soonet python=3.8

Then install the dependencies with pip.

conda activate soonet
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt

2. Download data

  • You should request access to the MAD dataset from official webpage. Noded that all our experiments are implemented on MAD-v1.
  • Upon completion of the download, extract the zip file contents and allocate the data to the "data/mad" directory.

3. Data preprocess

Use the following commands to convert the annotation format and extract the sentence features.

python preprocess/proc_mad_anno.py
python preprocess/encode_text_by_clip.py

The final data folder structure should looks like

data
└───mad/
β”‚    └───annotations/
β”‚        └───MAD_train.json
β”‚        └───MAD_val.json
β”‚        └───MAD_test.json
β”‚        └───train.txt
β”‚        └───val.txt
β”‚        └───test.txt
β”‚    └───features/  
β”‚        └───CLIP_frame_features_5fps.h5
β”‚        └───CLIP_language_features_MAD_test.h5
β”‚        └───CLIP_language_sentence_features.h5
β”‚        └───CLIP_language_tokens_features.h5

πŸ”₯ Experiments

Training

Run the following commands for training model on MAD dataset:

python -m src.main --exp_path /path/to/output --config_name soonet_mad --device_id 0 --mode train

Please be advised that utilizing a batch size of 32 will consume approximately 70G of GPU memory. Decreasing the batch size can prevent out-of-memory, but it may also have a detrimental impact on accuracy.

Inference

Once training is finished, you can use the following commands to inference on the test set of MAD.

python -m src.main --exp_path /path/to/training/output --config_name soonet_mad --device_id 0 --mode test

😊 Citation

If you find this work useful in your research, please cite our paper:

@InProceedings{Pan_2023_ICCV,
    author    = {Pan, Yulin and He, Xiangteng and Gong, Biao and Lv, Yiliang and Shen, Yujun and Peng, Yuxin and Zhao, Deli},
    title     = {Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {13767-13777}
}

πŸ™πŸ» Acknowledgement

Our code references the following projects. Many thanks to the authors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.