howardyclo / papernotes Goto Github PK

View Code? Open in Web Editor NEW

128.0 28.0 6.0 35.23 MB

My personal notes and surveys on DL, CV and NLP papers.

deep-learning natural-language-processing paper-notes

papernotes's Introduction

See Issues.

papernotes's People

Contributors

Stargazers

Watchers

Forkers

stevenlol fendaq hzauccg ummeyhanitanin gwliu213 jwtxwd

papernotes's Issues

Metadata

Authors: Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
Organization: Georgia Institute of Technology & Facebook AI Research
Conference: CVPR 2018
Paper: https://arxiv.org/pdf/1803.09845.pdf
Code: https://github.com/jiasenlu/NeuralBabyTalk

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Metadata

Authors: Xi Chen, Yan Duan, +3 authors Pieter Abbeel
Organization: UC Berkeley & OpenAI
Conference: NIPS 2016
Paper: https://arxiv.org/pdf/1606.03657.pdf
3rd Party Code: https://github.com/eriklindernoren/PyTorch-GAN#infogan

Abstract

InfoGAN, an information-theoretic extension to the GAN that is able to learn disentangled representations in a completely unsupervised manner. (Related to #33)

Vanilla GAN

Objective: min_{G} max_{D} V(D, G) = E_{x ~ P_data} [log D(x)] + E_{z ~ noise} [log(1-D(G(z)))]
Problem: Input noise vector z has no restrictions on the manner in which the generator may use this noise. As a result, it is possible that the noise will be used by the generator in a highly entangled way, causing the individual dimensions of z to not correspond to semantic features of the data.

InfoGAN

Decompose the input noise vector z into 2 parts:
- Incompressible noise z (interpret as an uncertainty of dataset that cannot be encoded to meaningful factors of variation)
- Disentangled latent code c = {c_1, c_2, ..., c_L} (Encode factors of variation of dataset)
- Note both vectors are learned in an unsupervised manner.
- Problem: The generator may ignore the latent code: P_G(x|c) = P_G(x).
- Apply regularization by maximizing mutual information: I(c; G(z,c)).
Mutual information I(X;Y):
- Measures the “amount of information” learned from knowledge of random variable Y about the other random variable X.
- I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X), where H(.) is entropy.
- I(X;Y) is the reduction of uncertainty in X when Y is observed. If X and Y are independent, then I(X;Y) = 0, because knowing one variable reveals nothing about the other.
- Given any x ∼ P_G(x), we want P_G(c|x) to have a small entropy. In other words, the information in the latent code c should not be lost in the generation process (Address the above problem).
Objective: min_{G} max_{D} V_I(D, G) = V(D, G) - λ I(c; G(z,c))

Variational Mutual Information Maximization

Problem: I(c; G(z, c)) is hard to maximize directly as it requires access to the posterior P(c|x).
Obtain a lower bound if it by defining an auxiliary Q(c|x) to approximate P(c|x).
TODO: Upload lower bound derivation image in my macbook & change faster-rcnn folder name (lol)

Semi-supervised Multitask Learning for Sequence Labeling

Metadata

Authors: Marek Rei
Organization: University of Cambridge
Conference: ACL 2017
Link: https://goo.gl/h6p29c

Strategies for Training Large Vocabulary Neural Language Models

Metadata

Authors: Wenlin Chen, David Grangier and Michael Auli
Organization: Washington University and Facebook AI Research
Conference: ACL 2016
Paper: https://arxiv.org/pdf/1512.04906.pdf

Metadata

Authors: Felix Juefei-Xu, Vishnu Naresh Boddeti and Marios Savvides
Organization: CMU
Conference: CVPR 2018
Paper: https://arxiv.org/abs/1806.01817
Code: https://github.com/juefeix/pnn.pytorch.update (Recommended to read)
Reddit discussion: https://www.reddit.com/r/MachineLearning/comments/a04qsj/d_updates_on_perturbative_neural_networks_pnn/

Synthesizing Programs for Images using Reinforced Adversarial Learning

Metadata

Authors: Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, Oriol Vinyals
Organization: DeepMind
Release Date: Arxiv 2018
Paper: https://arxiv.org/pdf/1804.01118.pdf
Video: https://youtu.be/iSyvwAwa7vk

Phrase-Based & Neural Unsupervised Machine Translation

Metadata

Authors: Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer and Marc'Aurelio Ranzato
Organization: Facebook AI Research
Release Date: 2018 on Arxiv
Link: https://arxiv.org/pdf/1804.07755.pdf

Graph R-CNN for Scene Graph Generation

Metadata

Authors: Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra and Devi Parikh.
Organization: Georgia Institute of Technology & Facebook AI Research (FAIR).
Conference: ECCV 2018
Paper: https://arxiv.org/pdf/1808.00191.pdf
Code: https://github.com/jwyang/graph-rcnn.pytorch
Poster: https://www.cc.gatech.edu/~jyang375/Jianwei_Yang_files/eccv18_poster.pdf (good summary and clear figures)

‘Lighter’ Can Still Be Dark: Modeling Comparative Color Descriptions

Metadata

Authors: Olivia Winn and Smaranda Muresan
Organization: Columbia University
Conference: ACL 2018 (best short paper)
Paper: https://aclweb.org/anthology/P18-2125
Video: https://vimeo.com/288152700
Dataset: https://bitbucket.org/o_winn/comparative_colors

Cold Fusion: Training Seq2Seq Models Together with Language Models

Metadata

Authors: Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh and Adam Coates
Organization: Baidu Research, Sunnyvale, CA, USA.
Release Date: 2017 on Arxiv
Link: https://arxiv.org/pdf/1708.06426.pdf

On the State of the Art of Evaluation in Neural Language Models

Metadata

Authors: Gábor Melis, Chris Dyer and Phil Blunsom
Organization: DeepMind and University of Oxford
Release Date: 2017 (Under review in ICLR 2018)
Link: https://arxiv.org/pdf/1707.05589.pdf

CornerNet: Detecting Objects as Paired Keypoints

Metadata

Authors: Hei Law and Jia Deng
Organization: University of Michigan
Conference: ECCV 2018
Paper: https://arxiv.org/abs/1808.01244
Code: https://github.com/princeton-vl/CornerNet
Video: https://www.youtube.com/watch?v=aJnvTT1-spc

On Accurate Evaluation of GANs for Language Generation

Metadata

Authors: Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly
Organization: Google AI
Conference: NIPS 2018
Paper: https://arxiv.org/pdf/1806.04936.pdf

Deep Reinforcement Learning with a Natural Language Action Space

Metadata

Authors: Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng and Mari Ostendorf
Organization: University of Washington and Microsoft Research
Conference: ACL 2016
Paper: https://arxiv.org/pdf/1511.04636.pdf
Game simulator: https://github.com/jvking/text-games

MaskGAN: Better Text Generation via Filling in the ______

Metadata

Authors: William Fedus, Ian Goodfellow and Andrew M. Dai
Organization: Google Brain
Conference: ICLR 2018
Paper: https://arxiv.org/pdf/1801.07736.pdf
Code: https://github.com/tensorflow/models/tree/master/research/maskgan

Adversarial Contrastive Estimation

Metadata

Authors: Avishek Joey Bose, huan ling, Yanshuai Cao
Organization: Borealis AI & University of Toronto
Conference: ACL 2018
Paper: http://aclweb.org/anthology/P18-1094
Blog: https://www.borealisai.com/en/blog/adversarial-contrastive-estimation-harder-better-faster-stronger/
Author's original project paper: https://joeybose.github.io//assets/active-ace.pdf (mentions background of active learning)

DVQA: Understanding Data Visualizations via Question Answering

Metadata

Authors: Kushal Kafle, Scott Cohen, +1 author Christopher Kanan
Organization: Rochester Institute of Technology & Adobe Research
Conference: CVPR 2018
Paper: https://arxiv.org/pdf/1801.08163.pdf
Code: https://github.com/kushalkafle/DVQA_dataset

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Metadata

Authors: Irina Higgins, Arka Pal, +6 authors Alexander Lerchner
Organization: DeepMind
Conference: ICML 2017
Paper: https://arxiv.org/pdf/1707.08475.pdf

Abstract

This paper focuses on domain adaption issues in RL settings where an agent trained on a particular input distribution with a specified reward structure (source domain) is modified but the reward structure remains largely intact (target domain). The target domain can be unknown.
This paper aims to develop an agent that can learn a robust policy using observations and rewards obtained exclusively within the source domain. Here, a policy is considered as robust if it generalizes with minimal drop in performance to the target domain without extra fine-tuning.
This paper tackles the domain adaption problem by learning a disentangled/factorized representation of the world. Examples of such factors of variation in the world are object properties like color, scale, or position; other examples correspond to general environmental factors, such as geometry and lighting.
The purposed system, DARLA, relies on learning a latent state representation that is shared between the source and target domains, by learning a disentangled representation of the environment’s generative factors. Crucially, DARLA does not require target domain data to form its representations.

Framework

Formalized problem setting

The source domain D_{S} ≡ (S_{S}, A_{S}, T_{S}, R_{S}).
The target domain D_{T} ≡ (S_{T}, A_{T}, T_{T}, R_{T}).
Where S: State; A: Action; T: Transition function; R: Reward.
The domain adaption scenario: S_{S} ≠ S_{T}; A_{S} =A_{T}; T_{S} ≈ T_{T}; R_{S} ≈ R_{T}.
For example. Robot arm in simulated environment and real world: S: Raw pixels; A: Robot's action; T: Physics of the world; R: The performance on the task.

DARLA

Three stages pipeline:

Learning to see (the main contribution):
- Use a random policy to interact with environment to collect observations (require sufficient variability of factors and their conjunctions).
- Pretrain a β-VAE (#33) on those observations.
- However, the shortcomings of reconstructing in pixel space are known and have been addressed in reconstruction in feature space given by another neural network. (e.g., GAN or pretrained AlexNet)
- In practice, this paper found that using a denoising autoencoder (DAE) for β-VAE works best.
- In detail, they follow the masking noise of [1] with the aim for the DAE to learn a semantic representation of the input frames.
Problem: The DAE might also suffer from domain adapation problem. If the semantic representation learned by DAE doesn't transfer well from source to target domain, the β-VAE, which depends on DAE, will also suffer.
- After pretraining DAE, train β-VAE for reconstruction in DAE's feature space using L2 distance. DAE remains frozen.
Learning to act: The agent is tasked with learning the source policy via a standard RL algorithms (DQN, A3C and Episodic Control). The parameters of the encoder (which encodes raw pixels to internal state for the decoder to predict actions) of agent will not be updated. They also compared with UNREAL.
Transfer: Since the encoder already learns the disentangled representation of the world of source domain, such a policy would then generalize well to the target domain out-of-the-box. In this stage, we simply evaluate the agent in target domain without retraining.

Reference

[1] Context Encoders: Feature Learning by Inpainting by Pathak et al. CVPR 2016.

Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning

Metadata

Authors: Tobias Domhan and Felix Hieber
Organization: Amazon
Conference: EMNLP 2017
Link: https://goo.gl/eFj9gx

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Metadata

Authors: Kexin Yi, Jiajun Wu, +3 authors, Joshua B. Tenenbaum.
Organization: Harvard, MIT, DeepMind.
Conference: NIPS 2018
Paper: https://arxiv.org/pdf/1810.02338.pdf
Code: https://github.com/kexinyi/ns-vqa
Website: http://nsvqa.csail.mit.edu/

Domain Adaptive Faster R-CNN for Object Detection in the Wild

Metadata

Authors: Yuhua Chen, Wen Li, +2 authors Luc Van Gool
Organization: Computer Vision Lab, ETH Zurich & VISICS, ESAT/PSI, KU Leuven
Conference: CVPR 2018
Paper: https://arxiv.org/pdf/1803.03243.pdf
Code: https://github.com/yuhuayc/da-faster-rcnn

Motivation

Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop.
Two domain shifts are tackled: (1) image level adaptation (2) instance level adaptation.

Contributions

We provide a theoretical analysis of the domain shift problem for cross-domain object detection from a probabilistic perspective.
We design two domain adaptation components to alleviate the domain discrepancy at the image and instance levels, resp.
We further propose a consistency regularization to encourage the RPN to be domain-invariant.
We integrate the proposed components into the Faster R-CNN model, and the resulting system can be trained in an end-to-end manner.

H-divergence Definition

The H-divergence [1] is designed to measure the divergence between two sets of samples with different distributions.
H-divergence definition:
h(.) is a feature-level domain classifier, if the error is high for the best domain classifier, the two domains are hard to distinguish, so they are close to each other, and vice versa.
To align source and target domains, minimize the domain distance, which maximize the H-divergence.

Covariate Shift Definition

Notation: x: input; y: output; S: source domain; T: target domain; P: probability distribution.
P_{S}(x) ≠ P_{T}(x)
P_{S}(y|x) = P_{T}(y|x)

Domain Adaption Setting

Source images and labels are available.
Only target images are available.
Our task is to learn an object detection model adapted to the unlabeled target domain.
The setting is under the covariate shift assumption, where:
- Notation: C: class of the object; B: bounding box of an object;
- P_{S} (C, B, I) ≠ P_{T} (C, B, I)
- P_{S} (C, B| I) = P_{T} (C, B| I)
Image-level adaptation:
- In Bayes’s rule: P(C, B, I) = P(C, B| I) x P(I).
- Image level domain shift is caused by: P_{S} (I) ≠ P_{T} (I).
- Given an image, the detection results should be the same regardless of which domain the image belongs.
Instance-level adaptation:
- Again in Bayes's rule: P(C, B, I) = P(C|B, I) x P(B, I).
- Instance-level domain shift is caused by: P_{S} (B, I) ≠ P_{T} (B, I).
- Given the same image region containing an object, its category labels should be the same regardless of which domain it comes from.
Joint adaptation:
- Consider P(B, I) = P(B|I) x P(I)
- P(B|I) is assumed to be the same under covariate shift assumption.
- Thus if P_{S} (I) = P_{T} (I), we have P_{S} (B, I) = P_{T} (B, I)
- In other words, if the distributions of the image-level representations are identical for two domains, the distributions of the instance-level representations are also identical.
- Yet, it is generally non-trivial to perfectly estimate the conditional distribution P(B|I), since:
  - In practice it may be hard to perfectly align the marginal distributions P(I), which means the input for estimating P(B|I) is somehow biased.
  - The bounding box annotation is only available for source domain training data, therefore P(B|I) is learned using the source domain data only, which is easily biased toward the source domain.

Method

We propose to perform domain distribution alignment on both the image and instance levels, and to apply a consistency regularization to alleviate the bias in estimating P(B|I).

To align the source and target domain, train a domain classifier, thus we have 2 domain classifier:
- Notation: D denotes domain label.
- Image-level domain classifier: P(D|I)
- Instance-level domain classifier: P(D|B, I)
By Bayes’ theorem: P(D|B, I) P(B|I) = P(B|D, I) P(D|I).
By enforcing the consistency between two domain classifiers, i.e., P(D|B, I) = P(D|I), we could learn P(B|D, I) to approach P(B|I).

Learning Disentangled Joint Continuous and Discrete Representations

Metadata

Authors: Emilien Dupont
Organization: Schlumberger Software Technology Innovation Center
Conference: NIPS 2018
Paper: https://arxiv.org/pdf/1804.00104.pdf
Code: https://github.com/Schlumberger/joint-vae

Note: Related to #39 #33

The Natural Language Decathlon: Multitask Learning as Question Answering

Metadata

Authors: Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher
Organization: Salesforce Research
Publish Date: 2018.06
Paper: https://arxiv.org/pdf/1806.08730.pdf
Code: https://github.com/salesforce/decaNLP
Blog: https://einstein.ai/research/blog/the-natural-language-decathlon
Video: https://www.youtube.com/watch?v=MENYCdm1eis
Website: http://decanlp.com/

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Metadata

Authors: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Organization: FAIR
Conference: NIPS 2015
Paper: https://arxiv.org/pdf/1506.01497.pdf

Faster R-CNN

Faster R-CNN with VGG-16

Two modules:

Region Proposal Networks: A FCN (fully convolution network) that proposes regions. (Serve as the "attention" for Fast-RCNN)
Fast R-CNN [1]: A classifier that uses proposed regions.

Region Proposal Networks (RPN)

Input: An image of any size.
Output: A set of rectangular object proposals, each with an objectness score.
Slide a small network over the last conv. feature map.
- Input: n x n spatial window of conv. feature map. (n = 3 in this paper)
- Each spatial window is projected to feature vector (512-d for VGG-16), then fed into two sibling FCs, a box-regression layer (reg) and a box classification layer (cls).
- This architecture is naturally implemented with an n × n conv. layer followed by two sibling 1 × 1 conv. layers (for reg and cls, respectively).
- For each spatial (sliding) window, multiple regions proposals (boxes) are predicted simultaneously, where the number of maximum possible proposals for each location is denoted as k.
  - The reg layer outputs 4k (coordinates of k boxes)
  - The cls layer outputs 2k scores of being a object or not for k boxes.
  - The k proposals are parameterized relative to k reference boxes, which we call anchors.
  - Anchor: An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. (This paper uses 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For conv. feature map of a size W × H (typically ∼2,400), there are W x H x k anchors in total.
RPN is translation-invariant. (Guarantee that the same proposal is generated if an object is translated).
- (4 + 2) x 9-d conv. output layer in the case of k = 9 anchors.
- Considering the feature projection layers, our proposal layers parameter count is 3 × 3 × 512 × 512 + 512 × (4 + 2) × 9 = 2.4 × 10^6.
Multi-scale anchors as regression references:
- A pyramid of anchors: Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios.
- It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size.
- The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
Loss function:
- Binary label (being an object or not) for each anchor.
- Positive label:
  - The anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box.
  - An anchor that has an IoU overlap higher than 0.7 with any ground-truth box.
  - Note that a single ground-truth box may assign positive labels to multiple anchors.
- Negative label: The anchor's IoU ratio is lower than 0.3 for all ground-truth boxes.
- Anchors that are neither positive nor negative do not contribute to the training objective.
- Minimize the multi-task loss in Fast R-CNN:
- i: An index of an anchor in a mini-batch; p_{i}: Prob. of being an object; t_{i}: A vector representing the 4 parameterized coordinates of the predicted bounding box; *t_{i}**: Ground-truth bounding boxes associated with a positive anchor.
- L_{cls}: log-loss; L_{reg}: smoothed L1 loss.
Parameterizations of the 4 coordinates:
- x, y, w, and h denote the box’s center coordinates and its width and height.
- Variables x, x_{a}*, and x are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h).
- Can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
- Bounding-box regression: The features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
Training:
- Follow the "image-centric" sampling strategy [1].
- It is possible to optimize for the loss functions for all anchors, but this will bias toward negative samples as they are dominated.
- Randomly sample 256 anchors in an image where positive:negative = 1:1. Pad the mini-batch with negative ones if there're fewer than 128 positive sample anchors.
- Adopt 4-Step Alternating Training
  - Initialized with ImageNet-pretrained model and fine-tuned end-to-end for the region proposal task.
  - Train a separate detection network (also initialized with ImageNet-pretrained model) by Fast R-CNN using the proposals generated by the step-1 RPN.
  - At this point the two networks do not share conv. layers.
  - Use the detector network to initialize RPN training, but we fix the shared conv. layers and only fine-tune the layers unique to RPN.
  - Finally, keeping the shared conv. layers fixed, we fine-tune the unique layers of Fast R-CNN.
Implementation and hyperparameter details are provided in the paper.

Reference

[1] Fast R-CNN by Ross Girshick. ICCV 2015.

Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens

Metadata

Authors: Marek Rei and Anders Søgaard
Organization: University of Cambridge & University of Copenhagen
Conference: NAACL 2018
Paper: https://arxiv.org/pdf/1805.02214.pdf
Code: https://github.com/marekrei/mltagger

Notes on Theoretical and Heuristic Findings in GANs papers

Preface

In this note, I'll continue recording several findings whatever I think it's important or useful. I'll be focusing on the theoretical and heuristic parts in several GANs papers. This thread will be actively updated whenever I read a GANs paper! 😊

Notations:

p_{data}: Probability density/mass function of real data.
p_{g}/{d}: Probability density/mass function of generator/discriminator.
G/D: Generator/Discriminator.
z: Noise input vector to the generator.

Generative Adversarial Nets (NIPS 2014)

For G fixed, the optimal D is: D*{G} (x) = p{data}(x) / (p_{data}(x) + p_{g}(x)).
Global optimality: GANs has a global optimum for p_{g} = p_{data} (i.e., the generator perfectly replicating the real data distribution).
Essentially, the loss function of GAN quantifies the similarity between the p_{g} and p_{data} by JS divergence (symmetric) when the discriminator is optimal.
Convergence: If G and D have enough capacity, and at each step of training, the discriminator is allowed to reach its optimum, given G, and p_{g} is updated so as to improve the criterion then p_{g} converges to p_{data}.
G must not be trained too much without updating D, in order to avoid mode collapse in G.

NIPS 2016 Tutorial: Generative Adversarial Networks (Video version)

Note: The discussion is under the scope of vanilla GANs.
Training GANs requires finding the Nash equilibrium of a game, which is a more difficult problem than optimizing an objective function.
Simply flipping the sign on the discriminator's objective function for the generator (i.e., maximizing the cross-entropy loss of the discriminator) could make the generator's gradient be vanished when the discriminator successfully rejects generator samples with high confidence.
MLE (maximum likelihood estimation) is equivalent to minimizing KL divergence KL(p_{data} || p_{g}).
VAE (variational autoencoder) v.s. GAN: VAE maximizes MLE but GANs aims to generate realistic samples instead of maximizing MLE.
GANs minimizes JS divergence which is similar to minimizing reverse KL divergence (i.e. KL(p_{g} || p_{data}). (KL divergence is not symmetric).
GANs do not use MLE, but it can be do so by modifying the generator's objective function, under the assumption that the discriminator is optimal. GANs still generate realistic samples even using MLE. (See the paper "On Distinguishability Criteria for Estimating Generative Models" by Goodfellow. ICLR 2015. Also see the video at 55:00). Thus, the choice of the divergence (KL v.s. reverse KL) cannot explain why GANs can generate realistic samples.
Maybe it is the approximation strategy of using supervised learning to estimate the density ratio that leads to the generated samples very realistic. (See the video at 59:15)
GANs often choose to generate from very few modes; fewer than the limitation imposed by the model capacity. The reverse KL prefers to generate from as many modes of the data distribution as the model is able to; it does not prefer fewer modes in general. This suggests that the mode collapse is driven by a factor other than the choice of divergence.
Comparison to MLE and NCE: See #25.
Training tricks:
- Virtual batch norm > batch norm (avoid to generate highly correlated samples within a batch)
- See more on "How to Train a GAN? Tips and tricks to make GANs work" by Chintala et al.
Mode collapse is believed not be caused by minimizing the reverse KL, since minimizing the forward KL still happens mode collapse. The deficiency design of minimax game could be a reason causing mode collapse. See the paper "Unrolled Generative Adversarial Networks" that successfully generate different modes of data.
Model architectures that cannot capture global structure will cause generated images with wrong global structure.
See "A note on the evaluation of generative models" for a good overview of evaluating GANs.

Generative Adversarial Networks (GANs): What it can generate and What it cannot? (Arxiv 2018)

This paper summarizes many GANs papers for addressing different challenges. Nice summary!

Machine Learning that Matters

Metadata

Author: Kiri L. Wagstaff
Organization: Jet Propulsion Laboratory, California Institute of Technology
Conference: ICML 2012
Paper: https://arxiv.org/pdf/1206.4656.pdf

Unsupervised Pretraining for Sequence to Sequence Learning

Metadata

Authors: Prajit Ramachandran, Peter J. Liu and Quoc V. Le
Organization: Google Brain
Conference: EMNLP 2017
Link: https://goo.gl/n2cKG9

Regularizing and Optimizing LSTM Language Models

Metadata

Authors: Stephen Merity, Nitish Shirish Keskar and Richard Socher
Release Date: 2017 on Arxiv
Link: https://arxiv.org/pdf/1708.02182.pdf

Vision with Referring Expressions

Vision with Referring Expressions (Last Update Date: 2019/03/06)

A curated list of deep learning papers of computer vision with referring natural language. This line of research is also related to image captioning, visual question answering, multimodal grounding for language and multimodal machine learning

Survey

From Image to Language and Back Again by Belz et al. Natural Language Engineering 2018.

Understanding disentangling in β-VAE

Metadata

Authors: Christopher P. Burgess, Irina Higgins, +4 authors Alexander Lerchner
Organization: DeepMind
Publish Date: 2018.04
Paper: https://arxiv.org/pdf/1804.03599.pdf
3rd-party code: https://github.com/1Konny/Beta-VAE

Useful Tutorials of VAE and β-VAE

Read From Autoencoder to Beta-VAE or What a Disentangled Net We Weave: Representation Learning in VAEs for understanding their intuition.
Read Variational Coin Toss for understanding the intuition of variational inference (basics of VAE).
Read variational inference notes in Stanford CS228 - Probabilistic Graphical Models, or refer more mathematical details in A Tutorial on Variational Bayesian Inference.
The original VAE paper and the Notes on Variational Autoencoders.
This paper is a follow-up work of the original β-VAE paper.

Background

β-VAE is a state of the art model for unsupervised visual disentangled representation learning.
β-VAE adds an extra hyperparameter β to the VAE objective, which constricts the effective encoding capacity of the latent bottleneck and encourages the latent representation to be more factorized.
The disentangled representations learned by β-VAE have been shown to be important for learning a hierarchy of abstract visual concepts conducive of imagination (SCAN, Higgins et al.) and for improving transfer performance of reinforcement learning policies, including simulation to reality transfer in robotics (DARLA. Higgins et al.)

Motivation

It is currently unknown what causes the factorized representations learnt by β-VAE to be axis aligned with the human intuition of the data generative factors compared to the standard VAE.
Furthermore, β-VAE has other limitations, such as worse reconstruction fidelity compared to the standard VAE. This is caused by a trade-off introduced by the modified training objective that punishes reconstruction quality in order to encourage disentanglement within the latent representations.
This paper attempts to shed light on the question of why β-VAE disentangles, and to use the new insights to suggest practical improvements to the β-VAE framework to overcome the reconstruction-disentanglement trade-off.

Understanding disentangling in β-VAE

From information bottleneck principle (Tishby et al. 1999) perspective, the β-VAE training objective encourages the latent distribution q(z|x) to efficiently transmit information about the data points x by jointly minimizing the β-weighted KL term and maximizing the data log likelihood.
A strong pressure for overlapping posteriors encourages β-VAE to find a representation space preserving as much as possible the locality of points on the data manifold.
Hypothesis: β-VAE finds latent components which make different contributions to the log-likelihood term of the objective function. These latent components tend to correspond to features in the data that are intuitively qualitatively different, and therefore may align with the generative factors in the data.
For example, consider optimizing the β-VAE objective under an almost complete information bottleneck constraint (i.e. β >> 1). The optimal thing to do in this scenario is to only encode information about the data points which can yield the most significant improvement in data log-likelihood (i.e. Eq(z|x)[log p(x|z)]).

Intuition of Improvement (The most important part)

For example, in the dSprites dataset (consisting of white 2D sprites varying in position, rotation, scale and shape rendered onto a black background) the model might only encode the sprite position under such a constraint. Intuitively, when optimizing a pixel-wise decoder log likelihood, information about position will result in the most gains compared to information about any of the other factors of variation in the data, since the likelihood will vanish if reconstructed position is off by just a few pixels.
Continuing this intuitive picture, we can imagine that if the capacity of the information bottleneck were gradually increased, the model would continue to utilize those extra bits for an increasingly precise encoding of position, until some point of diminishing returns is reached for position information, where a larger improvement can be obtained by encoding and reconstructing another factor of variation in the dataset, such as sprite scale.
They further test this intuition by training a model to generate dSprites conditioned on ground truth factors, with a controllable information bottleneck. Each factor is independently scaled by a learnable parameter and are subject to independently scaled additive noise (also learned), similar to the reparameterized latent distribution in β-VAE. Throughout the training, the capacity of information bottleneck increases linearly. The experiment shows that the early capacity is allocated to positional latents only (x and y), followed by a scale latent, then shape and orientation latents.

Reference

SCAN: Learning Hierarchical Compositional Visual Concepts by Irina Higgins et al. ICLR 2018.
DARLA: Improving Zero-Shot Transfer in Reinforcement Learning by Irina Higgins et al. ICML 2017

Unsupervised Machine Translation using Monolingual Corpora Only

Metadata

Authors: Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato
Organization: Facebook AI Research
Conference: ICLR 2018
Link: https://openreview.net/forum?id=rkYTTf-AZ

A Study of Reinforcement Learning for Neural Machine Translation

Metadata

Authors: Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai and Tie-Yan Liu.
Organization: MSRA
Conference: EMNLP 2018
Paper: https://arxiv.org/pdf/1808.08866.pdf

Visual Question Answering: Datasets, Algorithms, and Future Challenges

Metadata

Authors: Kushal Kafle and Christopher Kanan
Organization: Chester F. Carlson Center for Imaging Science Rochester Institute of Technology
Paper: https://arxiv.org/pdf/1610.01465.pdf
Journal: Computer Vision and Image Understanding 2017

Reaching Human-Level Performance in Automatic Grammatical Error Correction: An Empirical Study

Metadata

Authors: Tao Ge, Furu Wei, Ming Zhou
Organization: MSRA
Conference: ACL 2018
Original Paper: http://aclweb.org/anthology/P18-1097 (present a detailed comparison and analysis for different fluency boost learning and inference methods, which isn't summarized here.)
Follow-up Paper: https://arxiv.org/pdf/1807.01270.pdf
Video: https://vimeo.com/285802209

Deep Contextualized Word Representations

Metadata

Authors: Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee and Luke Zettlemoyer
Organization: Allen Institute for Artificial Intelligence
Conference: NAACL 2018 Best Paper
Link: https://arxiv.org/pdf/1802.05365.pdf

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

Metadata

Authors: Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Janani Padmanabhan and Graham Neubig
Organization: Language Technologies Institute, Carnegie Mellon University
Release Date: 2018 on Arxiv
Link: https://arxiv.org/pdf/1804.06323.pdf

A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction

Metadata

Authors: Shamil Chollampatt and Hwee Tou Ng1
Organization: National University of Singapore
Release Date: 2018 on Arxiv
Link: https://arxiv.org/pdf/1801.08831.pdf

Surveys for Deep Visual Domain Adaptation

Deep Visual Domain Adaptation: A Survey

Authors: Mei Wang & Weihong Deng
Date: 2018.02
Paper: https://arxiv.org/pdf/1802.03601.pdf

Domain Adaptation for Visual Applications: A Comprehensive Survey

Author: Gabriela Csurka
Date: 2017.02
Paper: https://arxiv.org/pdf/1702.05374.pdf

DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency Detection

Metadata

Authors: Kuang-Jui Hsu, Yen-Yu Lin, Yung-Yu Chuang
Organization: Acadamia Sinica & NTU
Conference: CVPR 2019
Paper: http://cvlab.citi.sinica.edu.tw/images/paper/cvpr-hsu19.pdf
Code: https://github.com/KuangJuiHsu/DeepCO3

On Distinguishability Criteria for Estimating Generative Models

Metadata

Author: Ian J. Goodfellow
Organization: Google
Conference: ICLR 2015
Paper: https://arxiv.org/pdf/1412.6515.pdf

Exploring the Limits of Language Modeling

Metadata

Authors: Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer and Yonghui Wu
Organization: Google Brain
Release Date: 2016 on Arxiv
Paper: https://arxiv.org/pdf/1602.02410.pdf
Code: https://github.com/tensorflow/models/tree/master/research/lm_1b

Unsupervised Text Style Transfer using Language Models as Discriminators

Metadata

Authors: Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, Taylor Berg-Kirkpatrick
Organization: CMU and DeepMind
Conference: NIPS 2018
Paper: https://arxiv.org/pdf/1805.11749.pdf
Publish Date: 2018.05

Auxiliary Objectives for Neural Error Detection Models

Metadata

Authors: Marek Rei and Helen Yannakoudakis
Organization: University of Cambridge
Conference: BEA@EMNLP 2017
Link: http://www.aclweb.org/anthology/W17-5004

Evaluating ‘Graphical Perception’ with CNNs

Metadata

Authors: Daniel Haehn, James Tompkin, and Hanspeter Pfister
Organization: Harvard (Visual Computing Group) & Brown
Conference: IEEE Transactions on Visualization and Computer Graphics (IEEE VIS), 2018
Paper: https://danielhaehn.com/papers/haehn2018evaluating.pdf
Supplemental Material: https://danielhaehn.com/papers/haehn2018evaluating_supplemental.pdf
Video: https://vimeo.com/280506639
Code: https://github.com/rhoana/perception#readme
Poster: https://danielhaehn.com/papers/haehn2018evaluating_poster.pdf

A Simple Neural Network Module For Relational Reasoning

Metadata

Authors: Adam Santoro, David Raposo, (+4 authors), Timothy P. Lillicrap
Organization: DeepMind
Conference: NIPS 2017
Publish Date: 2017.06

Neural Arithmetic Logic Units

Metadata

Authors: Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, Phil Blunsom
Organization: DeepMind
Conference: NIPS 2018
Paper: https://arxiv.org/pdf/1808.00508.pdf
Code: https://github.com/iamtrask/NALU-2

Dataset Biases in Machine Learning

Dataset Biases in Machine Learning (Last Update Date: 2019/03/03)

Image Classification

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints by Zhao et al. 2017/07. EMNLP 2017.
Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network Embeddings by Alvi et al. 2018/09. ECCV 2018.
Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet by Brendel and Bethge. 2018/09. ICLR 2019.
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness by Geirhos et al. 2018/11. ICLR 2019.
Learning Not to Learn: Training Deep Neural Networks with Biased Data by Kim et al. 2018/12.

Image Captioning

Exploring Nearest Neighbor Approaches for Image Captioning by Devlin et al. 2015/05.
Women also Snowboard: Overcoming Bias in Captioning Models by Burns et al. 2018/03. ECCV 2018.

Visual Question Answering

Simple baseline for visual question answering by Zhou et al. 2015/12.
Analyzing the behavior of visual question answering models by Agrawal et al. 2016/06. EMNLP 2016.
Revisiting visual question answering baselines by Jabri et al. 2016/06. ECCV 2016.
Making the v in vqa matter: Elevating the role of image understanding in visual question answering by Goyal et al. 2016/12. CVPR 2017.
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering by Agrawal et al. 2017/12. CVPR 2018.
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization by Ramakrishnan et al. 2018/10. NeurIPS 2018.
Explicit Bias Discovery in Visual Question Answering Models by Manjunatha et al. 2018/11.

Referring Expression Comprehension

Visual Referring Expression Recognition: What Do Systems Actually Learn? by Cirik et al. 2018/05. NAACL 2018.

Natural Language Processing

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings by Bolukbasi et al. 2016/07. NIPS 2016.
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods by Zhao et al. 2018/04. NAACL 2018.
Adversarial Removal of Demographic Attributes from Text Data by Yanai Elazar and Yoav Goldberg. 2018/08. EMNLP 2018.
Learning Gender-Neutral Word Embeddings by Zhao et al. 2018/09. EMNLP 2018.
Understanding the Origins of Bias in Word Embeddings by Brunet et al. 2018/10.

Machine Unlearning

Towards Making Systems Forget with Machine Unlearning by Cao et al. S&P 2015.
Efficient Repair of Polluted Machine Learning Systems via Causal Unlearning by Cao et al. AsiaCCS 2018.
Learning to Unlearn: Building Immunity to Dataset Bias in Medical Imaging Studies by Ashraf et al. 2018/12. NeurIPS 2018.

Machine Learning Fairness

CS 294: Fairness in Machine Learning from UC Berkeley, Fall 2017.
Machine Learning Fairness from Google.

Compositional Attention Networks for Machine Reasoning

Metadata

Authors: Drew A. Hudson and Christopher D. Manning
Organization: Stanford University
Conference: ICLR 2018
Paper: https://arxiv.org/pdf/1803.03067.pdf
Code: https://github.com/stanfordnlp/mac-network
Video: https://www.youtube.com/watch?v=jpNLp9SnTF8

Linguistic Input Features Improve Neural Machine Translation

Metadata

Authors: Rico Sennrich and Barry Haddow
Organization: School of Informatics, University of Edinburgh
Conference: WMT 2016
Link: https://goo.gl/jqYQ8r

howardyclo / papernotes Goto Github PK

papernotes's Introduction

papernotes's People

Contributors

Stargazers

Watchers

Forkers

papernotes's Issues

Metadata

Metadata

Abstract

Vanilla GAN

InfoGAN

Variational Mutual Information Maximization

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Metadata

Abstract

Framework

Formalized problem setting

DARLA

Reference

Further Readings

Metadata

Metadata

Metadata

Motivation

Contributions

H-divergence Definition

Covariate Shift Definition

Domain Adaption Setting

Method

Further Readings:

Metadata

Metadata

Metadata

Faster R-CNN

Faster R-CNN with VGG-16

Region Proposal Networks (RPN)

Reference

Further Reading

Metadata

Preface

Notations:

Generative Adversarial Nets (NIPS 2014)

NIPS 2016 Tutorial: Generative Adversarial Networks (Video version)

Generative Adversarial Networks (GANs): What it can generate and What it cannot? (Arxiv 2018)

Metadata

Metadata

Metadata

Vision with Referring Expressions (Last Update Date: 2019/03/06)

Survey

Dataset

Detection

Tracking

Moment Localization

Segmentation

Grounding

Diagnosing

Metadata

Useful Tutorials of VAE and β-VAE

Background

Motivation

Understanding disentangling in β-VAE

Intuition of Improvement (The most important part)

Reference

Further Readings