Papers-Reading

My reading notes on DL papers, along with my personal comment of each paper, so there may exist lots of mistakes, I really appreciate you to point out.

Neural Style Transfer

Neural Style Transfer: A Review ⭐⭐⭐⭐
- Investigate the works of Neural Style Transfer till May of 2016.
Demystifying Neural Style Transfer
- Prove that matching the Gram matrices is actually equivalent to minimize the Maximum Mean Discrepancy(MMD) with second order polynomial kernel.
- Try out for different kernels and parameters.
Fast Patch-based Style Transfer of Arbitrary Style
- A more advanced version of "Fast" Neural Style Transfer that can run in real-time and applies to infinite kind of styles.
- The drawback is the quality of stylized images is worse than "Fast" Neural Style which yet can only applies to finite styles.

Generative Model

VAE

Tutorial on Variational Autoencoders⭐⭐⭐⭐⭐

GAN for Image

Self-Attention Generative Adversarial Networks(!!Important)⭐⭐⭐⭐⭐
- Self-Attention GAN, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset.
- Using Self-Attention to learn long-range dependency.
- Several tricks inside:
  - Used Spectral-Normalization both on generator and discriminater, it proved to be more stable when training compared with SN-GAN.
  - Showed two-timescale update rule (TTUR) is an effect way for faster converge.
  - Indicated that self-attention mechanism at the middle-to-high level feature maps (e.g., feat32 and feat64) achieve better performance than at low level feature maps. The reason could be that the network receives more evidence with larger feature maps and enjoys more freedom to choose the conditions.
Conditional Generative Adversarial Nets⭐⭐⭐⭐
- cGAN, you can embed information to control the generated result.
- The information is feeded both into generator & discriminator. This can be done by concating the z(after fc) with label y(after fc).
- They experimented on MNIST generation with given number as y(one-hot), and a multimodel tagging, especially for the tagging work, they use an image as information by letting it pass through pretrained CNN to be the y.
Wasserstein GAN

GAN for Text&Audio generation

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
Synthesizing Audio with Generative Adversarial Networks⭐⭐⭐⭐
- The first listenable GAN based audio generation work.
- Using several methods as below:
  - 1D Conv(filter len=25) rather thant 5x5
  - Upsample by factory of 4 at each layer
  - Learned post processing filter & Phase shuffle to prevent discriminator learning to classify fake/real audio only by phase.
- Explore WaveGAN and SpecGAN, though the Inception Score of SpecGAN is higher (6.0) than WaveGAN(4.7), human prefer more about WaveGAN.(So is this means IC criterion can be updated ? Or means SpecGAN has some potential ?)
- Give a 0-9 audio dataset SC09.
C-RNN-GAN: Continuous recurrent neural networks with adversarial training⭐⭐⭐⭐
- LSTM based Generater and Discriminator with the dataset of MIDI classic work.
- Apply trick such as curriculum learning(continuing increase sequence length.), freezing(control the capability of G and D) and feature matching(I don't understand here...)
- Evaluation: Polyphony, Scale consistency, Repetitions, Tone span.
Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation(ICASSP 2018)
A Note on the Inception Score(ICML 2018 Workshop)
MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment(AAAI 2018)
MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation(ISMIR’17)
Language Generation with Recurrent Generative Adversarial Networks without Pre-training(ICML 2017 Workshop)

Attention

Attention Is All You Need
Neural Machine Translation by Jointly Learning to Align and Translate
- The first paper that proposed Attention.
Effective Approaches to Attention-based Neural Machine Translation
- Proposed global and local attention.

Speech

WaveNet

Pixel Recurrent Neural Networks(Best Paper of ICML2016) ⭐⭐⭐⭐
- I quickly skimmed this paper, it introduced a new method to generate image pixel by pixel with sequence model, which means you can only predict current pixel by it's previous pixels(namely the pixels above and to the left of it). To achieve this, they introduce a mask to make sure model can not read later pixels.
- The loss curve is much more smooth and interpretatable compared to GAN.
Conditional Image Generation with PixelCNN Decoders ⭐⭐⭐⭐⭐
- An improvement to PixelRNN & PixelCNN by adding an additional Gated activation unit.
- Use two stack(vertical and horizontal) to aviod the blind spot in Mask.
- Explore the performance of image generation in this kind of Gated PixelCNN in conditional distribution image, actually it seems not as good as GAN but, still another method and therefore lead to the famous WaveNet.
WaveNet: A Generative Model for Raw Audio ⭐⭐⭐⭐⭐
- A summary of papers of above, and use these methods in audio.
- Keywords: fuse the technic of Dilated Casual Convolution, Gated Activation Units and residual network along with skip connections.
- Based on Conditional WaveNet, they explored the experiments of Multi-Speaker Speech Generation, TTS(Text-To-Speech) and Music Generation by feeding additional input h. In speech generation, it's speaker ID of one-hot vector, in TTS it's the text while in music generation it's the tag of generated musich, like the instruments or the genre.
Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Tactron

Deep Voice

Deep Voice: Real-time Neural Text-to-Speech
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Neural Voice Cloning with a Few Samples
- A fresh new paper by Baidu of using a few samples to generate a lot of TTS audio.

Others

Towards End-to-End Speech Recognition with Deep Convolutional Neural ⭐⭐⭐
- They found it's possiable to use only CNN based end-to-end model to do Speech recognition(SR) task, the results is as good as those of RNNs.
- They treat audio spectrogram as 2-D CNN, building with CONV2D + Maxout + CTC archicture and finally evaluating the model in TIMIT dataset.
Neural Speech Synthesis with Transformer Network. Code:soobinseo/Transformer-TTS

Speech Conversion(Voice Style Transfer)

Papers related with my current research.

Some most related work !

Random CNN

A Powerful Generative Model Using Random Weights for the Deep Image Representation(NIPS 2016) ⭐⭐⭐⭐
- This paper shows untrained network can be used for image representation. It used random weights for VGG archicture to do Inverting deep representation, Texture synthesis and Style transfer. And the result is comparable with the pretrained VGG.
- It shows we can use this for archicture comparison without training them, so we can save a lot of time of comparing different archictures.
Texture Synthesis Using Shallow Convolutional Networks with Random Filters
Extreme Style Machines: Using Random Neural Networks to Generate Textures
On Random Weights and Unsupervised Feature Learning(ICML 2011)

Self-Attention

Attention Is All You Need
- The first paper of Self-Attention proposed by Google.
Github:Deep-Expression
- A github repo using only Self-Attention for TTS.
Self-Attention Generative Adversarial Networks
- Han Zhang, Ian Goodfellow.

Texture

WaveNet Based

A Wavenet for Speech Denoising.(ICASSP2018)
- An end-to-end learning method for speech denoising based on Wavenet.
A Universal Music Translation Network(FAIR. 2018,May 21th)⭐⭐⭐⭐
- Use WaveNet autoencoder to translate music across musical instruments, genres, and styles. All instruments share the same encoder, but with different decoder.
- Two major loss, one is for the loss between decoder resconstruction with the ground-truth. the other is an instrument classification loss.
- The results can be listened on youtube. Though the transfer result is not as good as human musician for known voice, for the unknown voice(like whistling) the transfer results is even better than human. (Maybe because human are not so familiar with the melody ?)
- They distance their work with Style Transfer, because they believe that a melody played by a piano is not similar except for audio texture differences to the same melody sung by a chorus
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders(Submitted on 5 Apr 2017)

VAE & GAN for Speech

van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural Discrete Representation Learning. In: NIPS. (2017)
- Vector Quantised-Variational AutoEncoder (VQ-VAE)
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
- Use CycleGAN for Voice Conversion.

License

This project is licensed under the terms of the MIT license.

sadam1195 / papers-reading Goto Github PK

papers-reading's Introduction