Coder Social home page Coder Social logo

txt2vid's Introduction

txt2video

Generating videos, conditioned on text with GANs. Honours thesis. This implementation contains the following paper implementations:

  1. To Create What You Tell
  2. TGAN
  3. TGANv2

With modifications to the last two to condition on text. Text is encoded with a Bi-LSTM which has been pretrained to generate the next token - which from memmory is the same methodlogy as "To Create What You Tell".

Additionally to capture motion in the discriminator more effectively, non-local blocks are utilised (self attention).

Conditional information is introduced similar to StackGAN++. Relativisitc losses are used.

For discriminator we compare the pairs:

  • ${(x_r, c_r), (x_f, c_r)}$
  • ${(x_r, c_f), (x_f, c_r)}$

For generator we only compare first pair above.

  • x_r is real video
  • x_f is fake video
  • c_r is caption correctly associated to video
  • c_f is caption not associated to video

Standard GAN loss is preferred due to 1 discrim step to 1 generator step.

Alternatively I did experiment with non-relativisitc loss, via the following intiution:

  • $(x_r, c_r)$ => [should be associated]
  • $(x_f, c_r)$ => [should not be]
  • $(x_r, c_f)$ => [should not be]
  • $(x_f, c_f)$ => not used

The last could optionally be used to learn but doesn't seem to be necessary (at least emprically)

Datasets & Results

Three datasets are used.

  1. Synthetic MNIST for moving digits
  2. MSR Video to Text (MSRVDC) dataset
  3. Custom dataset with videos scraped from reddit

Synthetic MNIST

MNIST with generated data from txt2vid/data/synthetic/generate.py

TCWYT Baseline:

From top to bottom:

'<start> digit 9 is left and right<end>' 1.jpg

'<start> digit 8 is right and left<end>' 2.jpg

'<start> digit 8 is bottom and top<end>' 3.jpg

'<start> digit 4 is top and bottom<end>' 4.jpg

MSRVDC

TCWYT (Conditional)

Bottom is ground truth for both of below

<start> a woman is saying about how to make vegetable tofu <unk> <end>'

tcwyt_1

<start> the person is cooking <end>'

tcwyt_2

TGANv2 (Unconditional)

1_uncond 2_uncond 3_uncond

TGANv2 (Conditional) + My Modifications

'<start> the man poured preserves over the chicken<end>' 1_cond

'<start> a person is dicing and onion<end>' 2_cond

'<start> a woman is peeling a large shrimp in a glass bowl of water<end>' 3_cond

reddit-videos

See https://github.com/miguelmartin75/reddit-videos

Didn't end up training on this dataset :/

Details & References

Please see thesis.pdf for more details, references, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.