Coder Social home page Coder Social logo

melika-ayoughi / self-contained-video-entity-discovery Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 2.0 23.66 MB

This is the official implementation and benchmark of the "Self-Contained Entity Discovery in Captioned Videos" paper.paper

License: MIT License

Python 100.00%

self-contained-video-entity-discovery's Introduction

Self-Contained-Video-Entity-Discovery

This is the official implementation and benchmark of the "Self-Contained Entity Discovery in Captioned Videos" paper.

Video Entity Discovery Task Definition:

Is it possible to uncover the visual entities occurring in a collection of captioned videos without requiring task-specific supervision or external knowledge sources? We introduce self-contained video entity discovery, where we seek to localize and align visual entities with named entities solely from corresponding captions. You can see the input of the model at the left side of this figure, which consists of video frames and their corresponding captions. The right side shows the localized visual entities with their correct named entities which are the expected outputs of the model.

problem definition

Method:

We propose a three-stage approach to tackle this problem:

Stage 1: Bipartite entity-name alignment.

First, mentions of named entities are extracted from textual descriptions and visual entity boxes from the corresponding frames. For each description-frame pair, a set of visual entity and named entity nodes are created and densely connected to form a collection of bipartite graphs, one graph for each description-frame pair, which provides initial cues about the entity names that correspond to boxes. However, this alignment is non-unique, incomplete, and noisy. The labels (i.e. names) assigned to each box after this stage are called {\em weak labels}.

method

Stage 2: Inter-box entity agreement.

Second, we seek to improve the matching between visual entities from frames and named entities from textual descriptions by taking advantage of the visual similarities of box entities. This is done by over-clustering the visual embeddings and then using the alignments from the bipartite graphs of the previous stage to aggregate clusters into entity distributions. The most frequently occurring entity name is then selected as the new name for each box this is termed the cleansed label.

Stage 3: Prototypical entity refinement.

By aggregating the most-occurring entity name for each cluster, the most frequent entity name will be over-represented in the entity distribution. This aggregation biases the entity discovery towards most-frequent labels. As a third stage we address this bias by computing a per-entity prototype, followed by a refinement of the entity assignment for boxes of the most frequent prototype based on the minimal prototypical distance.

How to use?

Further explanation will be added soon.

Some Qualitative Examples:

with limsi general qualitative

Scene Entity Discovery:

scene entities

Citation

Please consider citing this work using this BibTex entry,

@misc{https://doi.org/10.48550/arxiv.2208.06662,
  doi = {10.48550/ARXIV.2208.06662},
  
  url = {https://arxiv.org/abs/2208.06662},
  
  author = {Ayoughi, Melika and Mettes, Pascal and Groth, Paul},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Self-Contained Entity Discovery from Captioned Videos},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

self-contained-video-entity-discovery's People

Contributors

melika-ayoughi avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.