kweonwooj / papers Goto Github PK

View Code? Open in Web Editor NEW

318.0 318.0 34.0 2 KB

summary of ML papers I've read

papers's People

Contributors

Stargazers

Watchers

papers's Issues

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

Abstract

Propose Deep Voice 3, fully CNN + Attention-based Neural TTS system
Trains x10 faster, trained on 800+ hours with 2,000+ speakers
Contribution of kernel implementation to speed up inference

Details

Intro
- Main idea borrowed from NMT, CNN-based auto-regressive model
- Attention is monotonic and should be local in TTS
- TTS models can output several waveforms, later convert to speech via WORLD, Griffin-Lim and WaveNet
Model Architecture
- Input : Text Features (characters, phonemes, stresses)
- Output : Acoustic Features (mel-band spectrogram, linear-scale log magnitude spectrogram, vocoder features)
- Encoder
  - Pre-net + ConvNet x N + Post-net
  - Adds speaker embedding everywhere
- Attention Block
  - Speaker embedding is added to positional embedding
- Converter
  - Different loss function for different output
Text Preprocessing
- uppercase all chars, remove intermediate punctuation, end with punctuation
Result
- Higher MOS with x10 faster training

Personal Thoughts

Good model, with open implementation (https://github.com/Kyubyong/deepvoice3)
Want to train end-to-end once

Link : https://arxiv.org/pdf/1710.07654.pdf
Authors : Ping et al. 2017

100-epoch ImageNet Training with AlexNet in 24 Minutes

Abstract

Large batch (32K) training with LARS algorithm enables 100-epoch ImageNet training with AlexNet in 24 minutes.
One hour for 90-epoch ResNet-50 with 512 Intel KNLs
Usually, 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days

Details

Data parallelism is dominant in large-scale training due to its stability
Large batch size means less updates in a fixed epoch.
- existing solutions such as Linear Scaling (Krizhevsky 2014) and Warmup Scheme (Goyal et al. 2017) were not effective
- in this paper, LARS algorithm (You et al. 2017)with warm-up scheme achieves same test accuracy level with shorter training time.

Personal Thoughts

Hardware is the key component of large parallel training
interested in LARS method

Link : https://arxiv.org/pdf/1709.05011.pdf
Authors : You et al. 2017

Can Active Memory Replace Attention?

Abstract

Yes for case of soft attention : somewhat mixed result across tasks.
Active memory operate on all of memory in parallel in a uniform way, bringing improvement in algorithmic task, image processing and generative modellings.
Deos active memory perform well in machine translation? [YES]

Details

[Attention]
- Only a small part of the memory changes at every step, or the memory remains constant.
- Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nautre of softmax.
[Active Memory]
- Any model where every part of the memory undergoes active change at every step.
[NMT with Neural GPU]
- parallel encoding and decoding
- BLEU < 5
- conditional dependence between outputs are not considered
[NMT with Markovian Neural GPU]
- parallel encoding and 1-step conditioned decoding
- BLEU < 5
- possibly, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance
[NMT with Extended Neural GPU]
- parallel encoding and sequential decoding
- BLEU = 29.6 (WMT 14 En-Fr)
- active memory decoder (d) holds recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.
[CGRU]
- convolutional operation followed by recurrent operation
- stack of CGRU expands receptvie field of conv op
- output tape tensor acts as external memory of decoded logits

Personal Thoughts

Same architecture, but encoder and decoder hidden states may be doing different things
- encoder : embed semantic locally
- decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
Will it work for languages with different sentence order?
What part of the translation problem can we treat as convolutional?
Is Transformer a combination of attention and active memory?

Link : https://arxiv.org/pdf/1610.08613.pdf
Authors : Lukas Kaiser (Google Brain) et al. 2017

Single-Queue Decoding for Neural Machine Translation

Abstract

Propose more flexible decoding algorithm which can revisit discarded hypotheses in a later step, Single Queue Decoding
Design a penalty function to punish the hypotheses that tend to produce longer or shorter than expected

Details

Beam Search has disadvantages in that
- algorithm must give up some of existing hypotheses due to fixed beam size ~ exploration-exploitation dilemma
Single Queue Decoding
- Save all hypotheses in the single queue
- Extract arbitrary hypotheses with different length given the score function
- Score Function
Results
- Higher BLEU, relatively low speed delay
- SQD shows higher NLL value than beam search with same beam size

Personal Thoughts

Improvement in NLL is not significant
Searches over more spaces, hence longer inference time
Not sure this is a good decoding method

Link : https://arxiv.org/pdf/1707.01830.pdf
Authors : Shu et al. 2017

Dual Supervised Learning

Abstract

Utilize dual tasks that have intrinsic connections with each other due to the probabilistic correlation (En-Fr vs Fr-En translation, Speech Recognition vs Text to Speech, Image Classification vs Image Generation)
Propose dual supervised learning method that trains dual tasks simultaneously.
Improves performance of both tasks

Details

Conditional distributions of the primal and dual tasks should ssatisfy the following equality :

Add probabilistic duality term in loss function as specified as below :

lambda_xy are hyperparameters and best performance is obtained with lambda near ~0.01. It shows that the effect of probabilistic duality is quite small.

Personal Thoughts

Utilizing duality of the tasks is clever and practical in theory, it will lead to more data.
The improvement, however, seems limited.

Link : https://arxiv.org/pdf/1707.00415.pdf
Authors : Yingce Xia (School of Information Science and Technology, University
of Science and Technology of China, Hefei, Anhui, China) et al. 2017

Attention and Augmented Recurrent Neural Networks

Abstract

Augmenting RNN with Attention is a new trend.
A human with a piece of paper is, in some sense, much smarter than a human without.
Since vectors are the natural language of neural networks, the memory is an array of vectors

Details

Neural Turing Machine
- RNN with external memory bank
- reading and writing : instead of predicting where to read/write (discrete), model always read/writes on all area but simply learn the weight
Attentional Interfaces
- Basic attention
Adaptive Computation Time
- a way for RNNs to do different amounts of computation each step
Neural Programmers
- learns to create programs in order to solve a task

Personal Thoughts

attention is the key to next generation neural network

Link : https://distill.pub/2016/augmented-rnns/
Authors : Olah et al. 2016

Key-Value Memory Networks for Directly Reading Documents

Abstract

Introduce Key-Value Memory Network that makes reading documents more viable by utilzing different encodings in the addressing and output stages of the memory read operation
Achieves SOTA on existing WikiQA benchmark

Details

QA task has been directed toward using Knowledge Bases (KBs), which has proven effective, but it suffers from being too restrictive, as the schema cannot support certain types of answers, and too sparse.
Key-Value Memory Network is an extension of Memory Network.
- Knowledge source is cumulatively added to context
- question is embedded as query, which does inner product with key(context), and the softmax output is weighted on values(content).
- In KVMemNet, memory slots are pairs of vectors

Personal Thoughts

QA task actively uses attention to achieve better scores, but are not fully applicable/related to NMT

Link : https://arxiv.org/pdf/1606.03126.pdf
Authors : Miller et al. 2016

Refining Source Representations with Relation Networks for Neural Machine Translation

Abstract

Relation Network + NMT to refine the encoding representation of the source
Claims old information is often forgotten, and words in front are not processed with respect to words in latter.

Details

Relation Network learns the relationship between source words
- Q: how is it different from self-attention?
RN is composed of CNN + Graph Propagation + MLP Layer with LeakyRelus
In EN-CN, outperforms Transformer (Transformer is on par with NMT+ in En-Cn pair)
- seems they used word vocab, no bpe
Strong performance in long sentences (length 50+)
Visualization
- word alignment table
- actual source/ref/hyp results

Personal Thoughts

Enriching information passed to attention layer
Improvement in Decoding side should also be investigated
Re-read Relation Network and Related Works

Link : https://arxiv.org/pdf/1709.03980.pdf
Authors : Zhang et al. 2017

Unsupervised Neural Machine Translation

Abstract

NMT in completely unsupervised manner, relying on nothing but monolingual corpora.
Earlier works include triangulation and semi-supervised learning which still requires a strong cross-lingual signal
Shared Encoder + Denoising Autoencoder + Language specific Decoder

Details

Motivation
- A parallel corpora with good quality is difficult/expensive to acquire, whereas monolingual corpus is relatively easier.
- Low-resource languages or unique language pairs cannot afford to have a quality-parallel-corpora sufficient to train NMT model
Unsupervised NMT
- fixed cross-lingual embedding is obtained via word2vec
- Shared encoder encodes the meaning of the sentence, noise is included for robust learning
- L1 decoder (De-noising autoencoder) attempts to re-produce the input, trying to learn the latent variables of inputs
- L2 decoder (Language-specific decoder)
  - given En input to shared encoder, L2 decoder for English can be learned.
- Training takes place by alternating training objective L1/L2 from batch to batch.
Result
- Simple de-noising makes model learn to copy, instead of translate
- adding back-translation (L2 decoder) significantly improves the performance
- BPE helps to tackle unknown words, although still weak in named entities
- With small amount of parallel corpus (semi-supervised), performance can be improved further
Analysis
- Quantitatively, BLEU is still much lower (about BLEU 10 points) than SOTA supervised models, but translation in unsupervised manner does work.
- Qualitatively, UNMT proves it goes beyond a literal word-by-word substitution and correctly translates structural differences between languages, a good sign

Personal Thoughts

my idea on VAE + GAN model is very similar to unsupervised NMT
Interested in seeing the shared encoder work
Papago needs them, we can extend the language with abundant monolingual data

Link : https://arxiv.org/pdf/1710.11041.pdf
Authors : Artetxe et al. 2017

SYSTRAN’s Pure Neural Machine Translation Systems

Abstract

Comprehensive Technical Overview and Empirical Results of NMT in Systran
12 languages, for 32 language pairs

Details

[Corpus] Utilize 3 corpora for each language pair
- a baseline corpus (1 million sentences) for quick experiments (day-scale)
- a medium corpus (2-5M) for real-scale system (week-scale)
- a very large corpora with more than 10M segments
[Train Epoch]
- In Junczys-Dowmunt et al. 2016, authors mention using corpus of 5M sentences and training of 1.2M batches each having 40 sentences – meaning basically that each sentence of the full corpus is presented 10 times to the training.
- In Wu et al. 2016, authors mention 2M steps of 128 examples for English–French, for a corpus of 36M sentences, meaning about 7 iterations on the complete corpus.
- In our framework, for this release, we systematically extended the training up to 18 epochs and for some languages up to 22 epochs.
[PlaceHolder]
- In most language pairs, our strategy combines a vocabulary shortlist and a placeholder mechanism
- named entity placeholders (number, name, currency, url etc)
[Vocab]
- For enko and jaen, used BPE to reduce the vocabulary size but also to deal with rich morphology and spacing ﬂexibility that can be observed in Korean.
[Improvements of NMT]
- Word Features by Sennrich and Haddow et al. 2016
- Guided alignment strategy described in Chen et al. 2016
- Politeness feature by Sennrich et al. 2016a
[Pre-training]
- using pre-trained model learned with generic corpus, and re-learning with domain-specific corpus enhances domain adaptation with fast speed
[Efficient Model]
- 60% of parameters pruned w/o hurting the performance, by See et al. 2016
- Distillation by Kim and Rush, 2016
  - slightly higher accuracy results for a 70% reduction of the number of parameters and a 30% increase on decoding speed
[Problems to be solved]
- Missing words or parts of sentence
- Badly managing quotes
- Adequacy << Fluency in NMT
- Very long sentences
- Short word or the title of a news article
- Cleaning the corpus
- Alignment

Personal Thoughts

Very good paper, comprehensive and empirical.
Systran has good technical culture
Papago team must also be experimenting lots of ideas
Let's implement lots of ideas from academia!!

Link : https://arxiv.org/pdf/1610.05540.pdf
Authors : Crego et al. 2016

Compression of Neural Machine Translation Models via Pruning

Abstract

Magnitude-based pruning in NMT (RNN)
LSTM/Attention based NMT with 200M params can be pruned by 40% with very little performance loss on WMT14 En-De
With retraining, model can surpass original performance with 80% pruned model
Extending the magnitude-based pruning approach of Han et al 2015 for CNN into RNN

Details

Outline the NMT parameter architecture as below
[Pruning Schemes]
- Class-blind : prune x% with smallest magnitude, regardless of weight class
- Class-uniform : prune x% with smallest magnitude within each class
- Class-distribution : prune weights with magnitude less than standard deviation of each class
Class-blind scheme outperforms
Important Parameters
- Higher layers are more important than lower layers (this is opposite from CNN's phenomenon)
- Attention and Softmax are crucial
- FC params are important in lower vocab to embedding layer
- Forget gate params are less important in lower layers
- Diagonal values in recurrent params seem important
- least common word embeddings are less important
Retraining
- when re-training, performance enhances due to regularization effect
- when started from sparse format, achieves slightly lower performance
Generalizability
- Pruning scheme shows similar results and phenomenon in smaller NMT model with different language pair
Future Work
- This paper does single-step pruning and retraining. Multiple iterations may improve the performance.
- Other pruning methods (Optimal Brain Damage and Optimal Brain Surgery) are not empirically investigated

Personal Thoughts

Great experimentation. Clever pruning schemes that shows exactly which params are important
Applicable to Papago NMT system

Link : https://arxiv.org/pdf/1606.09274.pdf
Authors : See et al. 2016

Toward a full-scale neural machine translation in production: the Booking.com use case

Abstract

Empirical results on training NMT in large scale E-commerce setting by Booking.com
Covers optimization, training and evaluation

Details

Model Architecture
- 4-layer LTSM written in Lua
- Use global attention
- Use "case" embedding feature
- 0.3 residual
- no batch size indicated
- Handles named entity by pre-processing the input, detecting NE-tag in both sentences and replacing it with placeholder and simply copying it via attention map
Optimizer
- 1M En-De dataset
- SGD vs Adam vs Adagrad vs Adadelta (1.0, 0.0002, 0.1, 1.0)
- SGD performs best
Multi-GPU
- Async vs Sync Multi-GPU
- single GPU performs best ~ opposite of our in-house result
Corpus Size
- 1M, 2.5M, 5M, 7.5M, 10M corpus ran 90M iterations
- 10M performs best after-all, with higher human eval which is latent in BLEU score (more data, the better it is)
Evaluation
- Adequacy + Fluency metric

Personal Thoughts

Solid works and experiments on NMT
In-house data seems to be abundant and strong
good to see that they openly publish their results

Link : https://arxiv.org/pdf/1709.05820.pdf
Authors : Levin et al. 2017

Unsupervised Machine Translation using Monolingual Corpora Only

Abstract

Fully Unsupervised NMT using Monolingual Corpora only by FAIR
De-noising Auto-encoder + Language specific Decoder + Language Discriminator
Good paper on ICLR 2018
Enables better NMT for low-resource language pairs
Performance is still blocks below supervised NMT

Details

Key Idea
- Build a common latent space between the two languages
- Learn to translate by reconstructing in both domains according to two principles
- (i) the model has to be able to reconstruct a sentence in a given language from a noisy version of it, as in standard de-noising auto-encoders
- (ii) The model also learns to reconstruct any source sentence given a noisy translation of the same sentence in the target domain, and vice versa
Learning Objective
- De-noising Auto-Encoder : Embed sentence into latent space with noise and reconstruct it back

- Cross-Domain : Minimizing loss for (Source in lang1 -> Latent Space -> Reconstructed Target in lang2 -> Back into Latent Space -> Reconstruct Source in lang1 )

- Adversarial : Discriminator tries to identify the language by seeing the embedding in latent space, the model tries to fool by mapping same semantic sentences into same latent space in language independent manner

- Final Objective Function

Training
- Model starts with an unsupervised word-by-word translation in an unsupervised way
- Encoder tries to map the source sentence with noise into shared latent space, and reconstruct as in de-noising auto-encoder.
- Decoder learns to reconstruct the input from the latent space, given a language flag
- Discriminator tries to identify the source language in an adversarial setting

Model Selection
- BLEU score for two-way translation is used as a evaluation metric
- shows good correlation with classic BLEU

Results
- Not sure the baselines were really meaningful
- Unsupervised does learn something!

Monolingual vs Parallel Corpus
- 10M Monolingual ~ 100K Parallel Corpora

Ablation Study
- dropping subset of training scheme to see which part is critical in learning
- De-noising Auto-Encoder and Cross-Domain are both critical

Personal Thoughts

Great work of Unsupervised NMT
Better than Cho's paper because it is fully differentiable
Interesting to see the shared embedding concept!

Link : https://arxiv.org/pdf/1711.00043.pdf
Authors : Lample et al. 2017

Learning to Compute Word Embeddings On the Fly

Abstract

Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare (long tail)
Representations of rare words(OOV) is difficult to train, often too data-hungry
Propose a method for predicting embeddings of rare words on the fly with small amount of auxiliary data

Details

Definition Embedding
- Instead of using simple OOV embedding, use definition encoder that takes the dictionary definition of rare word and inject in the original encoder
Experiments
- baseline : simple RNN structure with OOV
- GloVe : unsupervised GloVe pretrained embedding with 840M monolingual corpus
- with Dictionary or Spelling
- Still margin weaker than GloVe, but given the amount of corpus/pre-training necessary, it is a good step forward

Personal Thoughts

How is actual implementation done?
- use same encoder while training?
- use already pre-trained encoder for definition encoder?
How is the performance in NMT?

Link : https://arxiv.org/pdf/1706.00286.pdf
Authors : Bahdanau et al. 2017

Deeply Supervised Nets

Abstract

Introduce Companion Objective to the individual hidden layers, in addition to the overall objective of output layer.
Directly pursue feature discriminativeness at all hidden layers to enhance network performance.

Details

companion objective is added to overall objective with hyperparameter gamma and decay rate alpha.

Personal Thoughts

I thought making hidden layers discriminative will greatly enhance performance, but the result is not as competitive as expected.
Optimizing the intermediate layers is an interesting area to investigate.

Link : https://arxiv.org/pdf/1409.5185.pdf
Authors : ChenYu Lee (UCSD) et al. 2014

StarSpace: Embed All The Things!

Abstract

a general-purpose neural embedding model
FAIR's SOTA embedding model

Details

StarSpace treats all features as embedding, a set of features (entity) as bag-of-features (also embedding) and optimizes w.r.t. to the similarity of label which is also an embedding.
Use of positive generator and negative generator stabilizes the learning mechanism.
widely applicable and shows strong performance in text classification, embedding etc

Personal Thoughts

can better embedding help performance of translation?

Link : https://arxiv.org/pdf/1709.03856.pdf
Authors : Wu et al. 2017

Weighted Transformer Network for Machine Translation

Abstract

Propose Weighted Transformer, a Transformer with modified attention layers, that performs better in BLEU score and converges 15~40% faster
Specifically, replace the multi-head attention by multiple self-attention branches

Details

In short, Transformer is Self-Attention + Positional Encoding + Multi-head Attention
Multi-head Attention + FFN
- generate multiple heads from same input QKV, concatenate them after single linear transformation

- After each layer, two-layered FFN is used

Proposed Multi-branch Attention
- κ can be interpreted as a learned concatenation weight and α as the learned
  addition weight
- κ scales the contribution of the various branches before α is used to sum
  them in a weighted fashion.
- Unlike the Transformer, which weighs all heads equally, the proposed mechanism allows for ascribing importance to different heads
- it only adds 192 parameter to existing 213M params in big model

Weighted Transformer in Graph

Training Details
- label smoothing with 0.1, dropout with 0.1
- Adam optimizer with same setting as Transformer, but larger learning rate for (κ,α) and freeze weight for (κ,α) for last 10K iterations
- 25,000 token per batch
- Training Time
  - Transformer small : big = 100K : 300K vs wTransformer small : big = 60K : 250K
Result

Parameter Search Result

Regularization Effect shown via viz

(κ,α) Trend during training
- each branch gets different amount of weight
- last 10K fixed

Gating
- replace the summation in Equation (7) by a gating structure that sums up the contributions of the top k branches with the highest probabilities
- not sure how it is done.

Personal Thoughts

ICLR 2018 submission
- ICLR review is good learning material.
Improvement in Performance is not of great margin
Adding new structure to existing SOTA, but without strong theoretical background, less novelty
Worth trying, when we already have good Transformer implementation

Link : https://openreview.net/forum?id=SkYMnLxRW&noteId=SkYMnLxRW
Authors : Anonymous

Trainable Greedy Decoding for Neural Machine Translation

Abstract

Decoder in NMT is simply rule-based, one can improve its performance by learning to decode
Trainable greedy decoder learns to manipulate the hidden state of a trained neural translation system with an arbitrary decoding objective (BLEU or perplexity)
trained by a novel variant of deterministic policy gradient, called critic-aware actor learning.

Details

Much of the research on neural machine translation has focused solely on improving the model architecture, not on decoding
Cho (2016) showed the limitation of greedy decoding by simply injecting unstructured noise into the hidden state of the neural machine translation system
Uses Deterministic Policy Gradient with Critic-Aware Actor Learning for stable learning algorithm

Personal Thoughts

must read Cho's paper on NPAD
must learn about reinforcement learning (actor-critic, policy gradient etc)
- mathematical formula for loss function of actor-critic model and model figures (Fig.1) are difficult to understand

Link : https://arxiv.org/pdf/1702.02429.pdf
Authors : Gu et al. 2017

Tacotron: Toward End-to-End Speech Synthesis

Abstract

Propose Tacotron, an End-to-End Text-to-Speech Seq-to-Seq Model with Attention with <text, audio> data
Frame level generation, much faster than sample level auto-regressive model

Details

Modern TTS models are complex and modular
- classic : text extraction, feature extraction, acoustic model, and vocoder
- Wavenet : slow due to its sample-level autoregressive nature, also requires conditioning on linguistic features from an existing TTS frontend, hence not end-to-end
- Wang's Seq2Seq model : requires pre-trained hidden Markov model aligner
Model Architecture
- Input : Character level text
- Encoder : Pre-net > CBHG
- Decoder : Pre-net > CBHG with frame_size=3, outputs spectrogram
- Griffin-Lim reconstruction : outputs waveform

CBHG
- Lee et al. 2017
- add non-causal convolutions, batch normalization, residual connections, and stride=1 max pooling from original architecture for better regularization

Griffin-Lim (Post-processing)
- Griffin-Lim algorithm (Griffin & Lim, 1984) synthesizes waveform from the predicted spectrogram
- used Tensorflow implementation
- it helps to improve performance by polishing the output via context data

Hyper-parameters

Data
- Internal North American English dataset with 24.6 hours
Result
- 5-scale mean opinion score evaluation

Personal Thoughts

Is the data enough?
Is there no automatic evaluation metric?

Link : https://arxiv.org/pdf/1703.10135.pdf
Authors : Wang et al. 2017

Adaptive Computation Time for Recurrent Neural Networks

Abstract

Introduces Adaptive Computation Time (ACT), an algorithm that allows RNN to learn how many computational steps to take between receiving an input and emitting an output
Experimental results on four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers show that performance is dramatically improved by the use of ACT.
In character-level language modelling on the Hutter prize Wikipedia, ACT does not yield large gains in performance but it provides insight into the structure of the data, with more computation allocated to harder-to-predict transitions such as spaces between words and ends of sentences.

Details

The approach pursued here is to augment the network output with a sigmoidal halting unit
whose activation determines the probability that computation should continue.
RNN vs RNN with ACT
- In short, it uses intermediate states which are activated by sigmoidal halting unit dynamically.
- Timing penalty is applied to minimize the amount of pondering when not necessary.
- Exact formulation and understanding of components must be revisited..! (re-read)
Experiment on Parity Error
- ACT does have an impact on lowering sequence error rate, with less penalty, it ponders more
Experiment on Wikipedia Character Prediction
- ACT with time penalty has negligible effect, but with lower time penalty, model ponders more and especially on spaces and eos.

Personal Thoughts

Beautiful vizualization
I'm not sure how ACT is utilizing attention well yet..
No direct link to usage in NMT
But Alex Graves is the must-read and must-understand researcher

Link : https://arxiv.org/pdf/1603.08983.pdf
Authors : Alex Graves et al 2017

Unsupervised Pre-training for Sequence to Sequence Learning

Abstract

Propose to pre-train the encoder and decoder of seq2seq model with the trained weights of two language models
An additional language modeling loss is used to regularize the model during fine-tuning
SOTA in WMT 2014 English->German

Details

Basic Methodology
- Train LM_src, LM_tgt using monolingual data
- Use LM-pre-trained model's Embedding Layer, 1st LSTM Layer and Softmax Layer in Decoder
Improvement
- Add monolingual loss during training to preserve the feature extractor of LMs (one can freeze pre-trained weights for few epochs and train the whole weight later too)
- Residual connection to mitigate initial noise corrupting the pre-trained weights
- Multi-layer attention, extracting both low and high level contextual information

Result
- Outperforms Back-translation

Ablation Study
- Pre-training decoder is better, because decoder does more difficult job of keeping semantics and generating sentence with correct syntax
- Gains of pre-training greatly overlap
- LM objective serves as a big regularizer, it assures fluency for sure

Personal Thoughts

Why only benchmark back-translation?
How can I train Language Model with internal monolingual data?
Authors did good and careful job of not corrupting pre-trained weights, knowing that LM weights provide good fluency features

Link : https://arxiv.org/pdf/1611.02683.pdf
Authors : Ramachandran et al. 2017

Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Abstract

Using monolingual corpora in low-resource NMT by adding copied-corpus in training data
BLEU benefits around ~1.2 in Turkish > English and Romanian > English translation task

Details

Related Works
- Back Translation by Sennrich et al. 2016: train target->source NMT to perform translation of target->source on target monolingual corpora, and resulting parallel corpora is combined with original parallel corpora.
- Multi-task systems by Johnson et al. 2016: combining multiple translation directions (French->English, German->English etc)
- This paper proposes using simple copy of target monolingual corpora to obtain parallel target->target corpora, and use it as training data
Amount of Resource per language pair
Performance improves with Low-Resource pairs
On the contrary of assumption,
- fluency does not improve with copied corpus added, shown by perplexity in language model
- the translation of pronouns, named entities and rare words improve, shown by pass-through accuracy
Amount of Monolingual Data
- Even with 3:1 ratio of monolingual to parallel corpora, BLEU increases.
- Monolingual copied corpora does not hurt learning even its ratio is relatively high

Personal Thoughts

Lesson : Low-Resource NMT techniques work when parallel corpus below 1M
Simple and Elegant method with incremental result, but not confident that BLEU +1.2 is a significant improvement in quality

Link : http://statmt.org/wmt17/pdf/WMT15.pdf
Authors : Currey et al 2017

CHRF: character n-gram F-score for automatic MT evaluation

Abstract

Propose the use of character n-gram F-score for automatic evaluation of machine translation output (language independent, tokenization independent)
CHRF3 score – for translation from English, showed the highest segment-level correlations

Details

ChrP (Precision) & ChrR (Recall)

65% preferred than BLEU

Personal Thoughts

Higher correlation to Human Evaluation than BLEU, observed in Results of the WMT17 Metrics Shared Task
Next Generation : chrF++

Link : http://www.statmt.org/wmt15/pdf/WMT49.pdf
Authors : Popovic et al. 2015

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Abstract

Propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold
Although there are lots of effort in network architecture, learning algorithms and novel applications, decoding is not well studied

Details

Recurrent models are de facto for linguistic tasks (language models, machine translation, dialogue, question answering etc)
Noisy Parallel Approximate Decoding (NPAD)
- meta-algorithm that runs in parallel many chains of the noisy version of an inner decoding algorithm(greedy or beam search)
- fully parallelizable, speed is almost equivalent to single greedy search
- a neighborhood in the hidden state space corresponds to a set of semantically similar configurations in the input space, regardless of whether those configurations are close to each other in the input space

- adding Gaussian noise in calculation of logit, start with high level of noise and anneal it as the decoding progresses - among M hypotheses, select the argmax of log probability

Why not sampling?
- more efficient to sample in the hidden state space than in the output state space
- hidden space 'fills in' the semantically similar neighbors, whereas state space is more sparse
Different from Diverse Decoding
- Diverse decoding is applicable to beam search only, and is conditional on previous beams
- NPAD is independent and parallelizable
Stochastic Sampling vs NPAD
- stochastically sampling from final softmax distribution does improve upon greedy beam search, but NPAD with correct parameter outperforms both NLL and BLEU

Number of Parallel Chains
- obviously, larger parallel chains improve performance

NPAD with beam search
- further improvement, although marginal, when combined with beam search

NPAD vs Diverse Decoding
- NPAD with beam search is marginally better than Diverse Decoding, almost identical to me

Personal Thoughts

Fast, applicable decoding mechanism
Interesting idea to inject noise (annealed by time-step) during training as well

Link : https://arxiv.org/pdf/1605.03835.pdf
Authors : Cho et al. 2016

Block Sparse RNN

Abstract

Even though pruning methods reduce number of parameters by 90%, the speed-up is less than expected on different hardware platforms due to indexing overhead, irregular memory access and inability to utilize array data-path
Propose pruning weights in block format, train with group lasso regularization to encourage sparsity in the model
10x smaller parameters with ~10% loss of accuracy

Details

Block Prune
- prune blocks of a matrix instead of individual weights, if maximum magnitude of a block is less than threshold

Pruning during Training
- this method actively prunes the parameters during training
- Hyperparameters for pruning are

Group Lasso Regularization
- L2 loss of group of weights

Experiments
- Speech Recognition system with CNN, RNN and FC layers
- < 5% loss of accuracy obtained when model parameter size is reduced by 1/3 ~ 1/4
- BP (block pruning), GLP (group lasso regularization with block pruning)

Speed-up
- Block Pruning has higher speed-up when batch is big

Pruning Schedule
- BP and GLP actively prunes than ordinary Weight Pruning

Performance over Prune ratio
- sudden decrease in performance after 90% threshold
- lower layers are pruned more than higher layers

Personal Thoughts

wanted to see pruning in NMT
batch=1 has speed-up of ~3, wonder how they implemented it.
- if op is sparse, then do I have to code new inference nmt.py?
Parameter settings in experiments were quite odd..
- not sure what the real message is, number of hidden sizes and resulting number of parameters are just all over the place
I think they will be rejected..

Link : open-reveiw @ ICLR2018
Authors :

Squeeze-and-Excitation Networks

Abstract

Propose a novel architectural unit, “Squeeze-and-Excitation”(SE) block, that adaptively re-calibrates channel-wise features in CNN.
Ensemble of SENets won 1st place in ILSVRC 2017 classification task with top-5 error rate 2.251% (~25% improvement from last year)

Details

Squeeze unit
- global average pooling over channel (describes the whole channel in single number)
Excitation unit
- 2 x FC layers with dimension reduction and dimension rescale (ratio of 16) with ReLU and sigmoid is multiplied to scale the original input

Can easily be inserted into existing SOTA CNN architectures. (ResNet, ResNeXt, Inception etc)
only fraction of parameter increase (<10% of whole param)
Figure 8. shows that lower layers have almost identical distributions across labels, upper layers have meaningfully different distributions across labels, and last layer seems to be saturated.

Personal Thoughts

Very clever idea of tweaking channel-wise inter-dependency
Applicable to re-calibrate channels of encoder states before they go into attention
Great work by self-driving car start-up, Momenta.ai in Beijing

Link : https://arxiv.org/pdf/1709.01507.pdf
Authors : Jie Hu (Momenta) et al. 2017

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Abstract

Three stage compression pipeline reduces parameter size by 35x ~ 49x
Compressed network has 3x ~ 4x speed enhancement and 3x ~ 7x energy efficiency.

Details

[Network Pruning] by Han et al. 2015
- Learn connectivity via normal network training.
- Prune the small-weight connections: all connections with weights below a threshold are removed from the network.
- Retrain the network to learn the final weights for the remaining sparse connections.
- Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16
  model.
[Trained Quantization and Weight Sharing]
- Use k-means clustering to identify the shared weights for each layer of a trained network
- Weights are quantized to k bins and updated via back propagation in uniform manner
- Linear centroid initialization is preferred over random/density due to better spreadability of large weights
[Huffman Coding]
- optimal prefix code commonly used for lossless data compression by Van Leeuwen et al. 1976
Experiments
- Baseline experiment with MNIST using LeNet-300 (1070KB -> 27KB ~ 40x)
- AlexNet, VGG-16 on ImageNet (35x ~ 49x compression)
- Pruning and Quantization works well together
- Speed Enhancement
- SOTA among loss-less compression

Personal Thoughts

Very well-written paper.
Applicable to NMT
Tested in CNN, will it work in RNN?
Systran also uses pruning only, since quantization with network sharing is not supported in cuSPARSE etc

Link : https://arxiv.org/pdf/1510.00149.pdf
Authors : Han et al. 2016

Beam Search Strategies for Neural Machine Translation

Abstract

Propose to speed up the decoder by applying a more flexible beam search strategy whose candidate size may vary at each time step depending on the candidate score
10% speed up in beam_size=5 without loss of accuracy

Details

Beam Search is disadvantageous
- Less adaptive, it expands candidates whose scores are much worse than the current best
- It discards hypotheses if they are not within the best scoring candidates, even if the scores are close
Search Strategies
- Relative Threshold Pruning
  - relative threshold compared to the best candidte
- Absolute Threshold Pruning
  - Discard candidates less than absolute margin than best candidate
- Relative Local Threshold Pruning
  - Consider the score of last generated word only in pruning
- Max Candidates per Node
  - Fix the number of candidates with same history in each time step
Fan Out per Sentence
- fan out : number of candidates we expand
- Original BeamSearch has linear fan out
- Proposed BeamSearch adaptively reduces the number of fan outs
Results
- With beam_size=5, 10~13% speed improvement with using all the proposed methods

Personal Thoughts

Better decoding speed without hurting performance, this is what I've wanted!

Link : https://arxiv.org/pdf/1702.01806.pdf
Authors : Freitag et al. 2017

A Teacher-Student Framework for Zero-Resource Neural Machine Translation

Abstract

Train source-to-target NMT (student) without parallel corpora available, guided by the existing pivot-to-target NMT (teacher) on a source-pivot parallel corpus
X : source, Y : target, Z : pivot

Details

Related Works
- Triangulated pivot-based method (X->Z, Z->Y), exposed to error propagation issue
- multilingual (shared encoder/decoder structure)
Teacher-Student Approach
- based on translation equivalence assumption (similar to knowledge distillation - sequence level)
Sentence-Level Teaching
- Assumption : If a source sentence x is a translation of a pivot sentence z, then the probability of generating a target sentence y from x should be close to that from its counterpart z
- Minimizing KL divergence of two distribution leads to good translation from X to Y
- Remove teacher model parameter, they are fixed
- Training objective is to minimize the below
Word-Level Teaching
- Similarly, training objective is to minimize the below
Result
- Word-sampling outperforms sentence-beam

Personal Thoughts

Does not have to go with absolute zero resource, rather try learning with monolingual only
Very similar to knowledge distillation, using the distribution learnt in the translation model to learn another student model

Link : https://arxiv.org/pdf/1705.00753.pdf
Authors : Chen et al. 2017

Ensemble Distillation for Neural Machine Translation

Abstract

Empirical experiments on Knowledge Distillation using Ensemble, OracleBLEU, same-sized student model etc

Details

Teacher Networks
- Ensemble Teacher Model : ensemble of 6 models with different random initialization as teacher
- Oracle BLEU Teacher Model : among beams, choose the one with highest bleu
Data Filtering
- Amongst translated training data, if TER is higher than 0.8, remove from both original training data and translated data
Experiment
- WMT2016 German -> English task with 3.9M parallel sentences
- 40k BPE
- 1,000 GRU, 620 Word Embedding, batch 64, SGD with annealing learning rate
Results
- Single Teacher : Training with all data in same-sized student model outperforms
- Ensemble Teacher : data filtering is better
- Oracle BLEU Teacher : not effective
- Ensemble Teacher with Small Student : always have loss of accuracy

Personal Thoughts

Application to Papago
- Try to get the best-performing teacher model via Ensemble or high beam-size, NPAD or whatever
- Training same-sized student model for better performance is impressive
- Training small student does have loss of accuracy, but ours is too large. We can do better

Link : https://arxiv.org/pdf/1702.01802.pdf
Authors : Freitag et al. 2017

Neural GPUs Learn Algorithms

Abstract

Neural Turing Machines (NTMs) learn an algorithm from examples, that are fully differentiable computers that use backpropagation to learn their own programming.
Propose Neural GPU, a type of convolutional gated recurrent unit that is highly parallel and efficient to train.
Neural GPU can be trained on short instances of an algorithmic task(addition and multiplication) and successfully generalize to long instances.

Details

Use CGRU (Convolutional Gated Recurrent Unit) that is similar to GRU with convolutional kernel added as main block.
Good performance in addtion and multiplication, good generalization to longer sequences.
Great effort in optimization process
- Grid Search : 3^6 = 729 parameters
- Curriculum Learning : train n-digit number only after making 90% performance in (n-1)-digit number
- Gradient Noise : add Gaussian noise to gradient, multiplied by fraction of non-fully-correct output
- Gate cutoff : cutoff for sigmoid function
- Parameter Sharing Relaxation : let hidden units of RNN be different and gradually converge to single parameter (Relaxation was critical in fitting the training data)

Personal Thoughts

Great engineering and effort in optimization process
Well-explained, thorought paper to read

Link : https://arxiv.org/pdf/1511.08228.pdf
Authors : Lukasz Kaiser, Ilya Sutskever (Google Brain) et al. 2016

Understanding Deep Learning Requires Rethinking Generalization

Abstract

We must rethink about generalization
Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training, but these traditional approaches fail to explain why large neural networks generalize well in practice.
Experiment with SOTA CNN models on image classification with SGD, fitting a random training data.

Details

Deep neural networks easily fit random labels.
- Effective capacity of the model is big enough to fully memorize the randomized training data.
- Inception V3 (with dropout and weight decay) fits perfectly to random training data - not truly generalizing.
[Summary] Both explicit and implicit regularizer, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.
- l2 norm is not absolute : weights with higher l2 norm have better generalization than weights with lower l2 norm
- SGD acts as an implicit regularizer
- On small data sets that even Gaussian kernel methods can generalize well with no regularization.
- Early stopping has potential of regularization - the effect is case-by-case.

Personal Thoughts

Understanding neural network is difficult, because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
Good models are models that generalize well, where is this good generalization really coming from?

Link : https://arxiv.org/pdf/1611.03530.pdf
Authors : Chiyuan Zhang (Google Brain) et al. 2017

Structured Attention Networks

Abstract

Adding structure to Attention module - a linear chain conditional random field and a graph-based parsing model
Experiments on tree transduction, neural machine translation, question answering and natural language inference shows better performance and improved behavior

Details

Attention is a function of key, value and query where key holds the whole context, query holds the context to be answered and value holds relevant contents.
In author's word, attention mechanism is the expectation of an annotation function with respect to a latent variable which is parameterized to be function of source and query.
Segmentation Attention
- using linear-chain CRF with pairwise edges, it adds pairwise structure
Syntactic Attention
- using graph-based parsing model, it adds tree-like structure
End-to-End training
- forward pass is simple
- backprop is not fully optimized with off-the-shelf tools
- training takes 5x slower than simple attention mechanism, inference is almost similar
Neural Machine Translation
- EnJa data from WAT, 500k sentences, less50
- character-level and word-level with vocab cut-off of 10
- result is not significant in word-level, slight increase in character-level
- Visualization of attention : shows richer, denser attention when structure is added

Personal Thoughts

Agree on enriching the attention mechanism is a good are of research
not sure EnJa from WAT was good benchmark corpus, no significant improvement
too much information is compressed into attention mechanism
- even the single token holds distributed context/content from surrounding.

Link : https://arxiv.org/pdf/1702.00887.pdf
Authors : Kim et al 2017

Learning to Translate in Real-time with Neural Machine Translation

Abstract

Simultaneous NMT with binary RL decision maker
Evaluation metric for simultaneous NMT is a combination of quality (BLEU) and delay

Details

Uni-directional RNN as NMT Enc-Dec
RL controller deciding whether to READ or WRITE in policy gradient

Evaluation
- Quality : BLEU
- Delay : Average Proportion (by Cho et al. 2016), Consecutive Weight Length and Target Delay

Personal Thoughts

Extensive experimentation seems to be done
simultaneous MT problem is very different from ordinary MT

Link : http://www.aclweb.org/anthology/E17-1099
Authors : Gu et al. 2017

Neural Machine Translation with Reconstruction

Abstract

NMT systems often lacks adequacy
Propose a novel encoder-decoder-reconstructor framework for NMT, utilizing target-source information as additional feedback

Details

New training objective, adding reconstruction error with lambda = 1

- Raises issue of sub-optimality in decoding likelihood objective, and empirically shows performance improves even with very large beam size (1000), but not quite sure this is important. In practice, beam size < 10 is enough.

Personal Thoughts

Surprised that BLEU improvement is small (Is BLEU +1.5 really significant?)
Use of target-source information in parallel corpus was impressive
NMT is difficult to operationalize problem

Link : https://arxiv.org/pdf/1611.01874.pdf
Authors : Tu et al. 2016

Understanding Black-box Predictions via Influence Functions

Abstract

Use influence function to trace a model's prediction back to its training data.
Approximation of influence function that requires gradients and Hessian vectors provides valuable information
Useful in debugging models and detecting dataset errors

Details

Using influence function, one can ask questions such as "What is the model parameter like when certain training data was missing/altered?" without re-training the whole model
Useful in detecting adversarial examples
Useful in fixing mislabeled examples by providing good candidate lists, but limited boost compared to simple listing via highest training loss

Personal Thoughts

Understanding neural network is difficult, because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
Good approximation methods are always powerful and applicable

Link : https://arxiv.org/pdf/1703.04730.pdf
Authors : Pang Wei Koh(Stanford), Percy Liang(Stanford)

Confidence through Attention

Abstract

Use attention distribution to evaluate the translation quality
Use attention-filtered synthetic data added to existing parallel corpus to improve NMT translation quality in BLEU

Details

Attention-based Metrics

Coverage Deviation Penalty
- aims to penalize the sum of attentions per input token for going too far from 1

Absentmindedness Penalty
- dispersion of attention is measured via the entropy of the predicted attention distribution. Again, we want the penalty value to be 1.0 for the lowest entropy and head towards 0.0 for higher entropies

Training NMT with additional data
- back-translation is good
- but, attention-filtered synthetic data is also better
- it helps especially in morphologically rich -> weak language direction

Personal Thoughts

Making data more rich and smoother by back-translation, copied-corpus, seq-level knowledge distillation and attention-filtered synthetic corpus is very strong

Link : https://arxiv.org/pdf/1710.03743.pdf
Authors : Rikters et al 2017

Findings of the 2017 Conference on Machine Translation (WMT17)

Abstract

Comprehensive explanation of WMT17 Tasks and their results
- Machine Translation tasks : news, biomedical and multimodal
- Evaluation tasks : metric and run-time estimation of MT quality
- Automatic Post-Editing task
- Neural MT training task
- Bandit Learning task

Details

Main in MT task (news)
- eval set is 1,500 per language direction. Total 3,000 sentences.
Evaluation of Direct Assessment (DA) via crowd-sourcing was impressive
- requires monolingual background
- careful control variables (replica of ref, real MT output, altered MT output) to filter out unreliable crowd (gamer)
- pearson correlation of 0.97+ with professional researcher's result

Personal Thoughts

WMT is a good conference.
Decisions being made such as continual usage of HTER as metric where HTER has discrepancy between BLEU and human evaluation for future reference, careful experimentation and validation of DA metric are mature

Link : http://statmt.org/wmt17/pdf/WMT17.pdf
Authors : Bojar et al. 2017

The Helsinki Neural Machine Translation System

Abstract

Helsinki NMT ranked 1st inWMT2017 News Translation task in English-Finnish

Details

Arsenals
- Layer Normalization : preliminary experiments showed no improvement
- Variational Dropout : dropout in recurrent states
- Context Gates : achieved better cross-entropy, but no improvement in BLEU or chrF3
- Coverage Decoder : preliminary experiments showed no improvement
- Ensemble : Proper ensemble is best, but Parameter Averaging also helps
Experiments
- Choice of Segmentation Strategy
  - BPE in decoder performs well, char-level decoder performs high in chrF3

- Ensemble - Proper ensemble is best, but parameter averaging helps

- In dev set, they found lots of contractions (wouldn't etc) which were not present in training set, so they de-tokenized them

Personal Thoughts

Lots of ideas, tested in preliminary baseline and applied effective ones in large scale data
Language specific tunings such as dev set tuning and exhaustive search in enc/dec segmentation strategy, which lead to first place in English-Finnish task

Link : https://arxiv.org/pdf/1708.05942.pdf
Authors : Ostling et al. 2017

What do Neural Machine Translation Models Learn about Morphology?

Abstract

Analyze the representations learned by neural MT models through part-of-speech and morphological tagging tasks
Parameters include : word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations

Details

Train NMT model and use hidden layer units to perform other linguistic task in order to compare the quality of representation in each parameter variables.
Findings
- char-level are better at learning morphology than word-level
- Lower layers of the encoder are better at capturing word structure, while deeper networks improve translation quality, suggesting that higher layers focus more on word meaning.
- Morphologically rich -> poor translation is difficult, the opposite is simpler task
- decoder learns very little about the word structure. When attention used, it learns even less.

Personal Thoughts

Good methodology, trying to understand the amount of information in the representation via looking at performance in other task was unique
Good experiments, visualizations and explanations to the phenomenon
Not sure, NMT has been more transparent.
Role of the decoder is still mystery
Role of depth is also mystery

Link : https://arxiv.org/pdf/1704.03471.pdf
Authors : Belinkov et al 2017

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Abstract

Neural TTS model, advanced from Deep Voice 1
Can train Multi-speaker Embedding in single TTS model

Details

Complicated inference system in Neural TTS
Architecture for Multi-Speaker
Tacotron with Speaker conditioning
Result

Personal Thoughts

TTS is another mysterious, but very interesting area
I must revisit with good passion to implement them!

Link : https://arxiv.org/pdf/1705.08947.pdf
Authors : Arik et al. 2017

Dynamic Routing Between Capsules

Abstract

A capsule network proposed by Geoffrey Hinton, using layer-wise parallel attention
Insight from attention in human vision where irrelevant details are ignored via sequence of fixation points
Activities of the neuron in an active capsule represent the various properties of a particular entity that is present in the image (position, thickness, size, orientation, deformation etc)

Details

Routing Algorithm
- Existing CNN simply max-pools the single scalar from matrix of numbers to extract the most impressive traits
- Capsules pools information from previous layer's capsules via dynamic routing algorithm
  - Routing Softmax determines the initial from L to L+1 connectivity -> input in L+1 is calculated via weighted sum -> input in L+1 is squashed to 0~1 range -> routing logit from L to L+1 is updated by L's prediction and squash (~attention mechanism)
Architecture
- Simple 3-layer CapsNet with routing connection between Primary caps and DigitCaps only
Result on MNIST
- better than CNN
What CapsNet learns
- each dimension in DigitCaps do learn some properties
MultiMNIST
- learning overlapping digits
- equivalent to SOTA ~ 5% error rate on tes set

Personal Thoughts

결국, Attention is how we improve neural network
details on training is not shared on the paper
how can I apply this to NMT?

Link : https://arxiv.org/pdf/1710.09829.pdf
Authors : Sabour et al. 2017

Non-Autoregressive Neural Machine Translation

Abstract

Non-autoregressive NMT allows an order of magnitude lower latency during inference.
Through knowledge distillation, the use of input token fertilities as a latent variables, and policy gradient fine-tuning, they achieve 2.0 BLEU points lower than autoregressive Transformer network used as a teacher

Details

Towards non-autoregressive decoding
- naive method is to predict each output independently, but this does not yield good result due to multimodality problem, a conditional independence between various target sentences.
- for example, 'thank you' in English can be translated into 'danke schon' or 'vielen dank', but 'danke dank' is not acceptable.

Non-Autoregressive Transformer (NAT)

Encoder stack
- encoder stays unchanged from original Transformer
Decoder stack
- Decoder inputs
  - copy source inputs using fertilities
  - Fertilities mean the number of times each input is copied into decoder inputs, it controls the "speed" at which decoder translates and determines the length of target sentence
- Non-causal self-attention
  - since no masking is needed for later tokens, simply use self-attention with masking out each query position only from attending to itself
  - Positional attention
    - include additional positional attention module in decoder layer, which adds stronger positional signal and allows decoder to perform local reordering
- Fertility
  - a latent variable that models the nondeterminism in the translation process
  - sample z from a prior distribution and then condition on z to non-autoregressively generate a translation
  - with max = 50
- Conditional Probability of a Target Translation, Y is
Translation Predictor and Decoding Process
- searching for all combinations of fertility is intractable, so
  - choose argmax fertility for each input word
  - choose expected value for fertility
  - noisy parallel decoding, samples random fertility sequence and decode in parallel, then choose argmax
Training (read again)
- Sequence-level Knowledge Distillation
  - use teacher-generated target corpus for training, original corpus results in too noisy and nondeterministic
- Fine-tuning stage with reverse KL divergence with teacher output distribution in a form of word-level knowledge distillation
  - KD is favorable towards highly peaked student output distributions than a standard cross-entropy error would be
- Joint training (read again)
  - sum of original distillation loss, expectation over fertility distribution, normalized with a baseline and other based on external fertility inference model via policy gradient and backprop
Experiments
- IWSLT16 En-De as development
- WMT14 En-De and WMT16 En-Ro as final result verification
- use shared BPE, for IWSLT use separate vocab and embedding
- transfer student's encoder weights from encoder weights from its teacher
- fertility prediction is supervised with fast_align by IBM Model 2 during training
Results
NAT 2~5 BLEU less than AT
Speed up for > x15 over beam search in teacher model
Good experimental scheme
Fine-tune does not converge with RL or BP alone, must use all three fine-tuning terms to get +1.5 BLEU
Overall Structure
src_len vs Latency
learning curve for NAT (bleu on dev set)

Personal Thoughts

Contributions
- Non-autoregressiveness
NPD is strong, let's implement
They tried all ideas, saw their poor performance and tried another
analysis of deterministicness, approximating distribution, policy gradient, roles of each module is still unkown to me
for naver_dic, try to reduce hyperparameters (model size, hidden size, layer, head, warmup step)
use their viz scheme (src_len vs latency, learning curve)

Link :
Authors :

Learning to Remember Rare Events

Abstract

Present a large-scale life-long memory module for use in deep learning
Uses nearest-neighbor algorithm (LSH for approximate version) to extract from memory
Memory module can be easily added to any part of a supervised neural network

Details

Memory has layer embedding as key, label as value and age for timestep

Loss term used in Memory module

If no positive-neighbor is found, update memory by adding a new key-value pair
If positive-neighbor is found, update the memory value by its average

Evaluated life-long memory module in image classification, machine translation, synthetic data with Extended GPU, all having positive impacts

Personal Thoughts

Lukasz - God - Kaiser paper
it is definitely must-implement paper

Link : https://arxiv.org/pdf/1703.03129.pdf
Authors : Kaiser et al. 2017

Statistical Machine Translation - Ch.13 Neural Machine Translation

Abstract

Draft of textbook chapter on neural machine translation.
A comprehensive treatment of the topic.

Details

Self-attention in Transformer model described in detail in page 87.

Personal Thoughts

Good comprehensive guide
self-attention model is really a thing

Link : https://arxiv.org/pdf/1709.07809.pdf
Authors : Koehn et al. 2017

Sequence-Level Knowledge Distillation

Abstract

Knowledge distillation approach applied to NMT
Experiment three types of knowledge distillation : word-level, sequence-level, sequence-level interpolation
Best student model runs 10 times faster than its sota teacher with little loss in performance
Network pruning can further reduce parameters with little loss in performance

Details

Overview of Three types of Knowledge Distillation

Word-level KD
- student-model learned on the word-level cross-entropy loss of teacher model

Sequence-level KD
- student-model learned on the new training set that is a beam-search result of teacher model
- beam-search result of teacher model is a tractable approximation of sequence-level knowledge

Sequence-level Interpolation KD
- uses original training data and beam-search result of teacher-model simultaneously to balance the objective function

Experiment on En-De
- 4M training data, 50k word-level
- 4 x 1,000 LSTM as teacher-model
- 2 x 300 LSTM and 2 x 500 LSTM as student-model

Result
- Seq-KD on 2 x 500 LSTM has better BLEU on greedy-search, comparable BLEU on beam-search and note that greedy-search result is 16.9% compared to 1.3% in baseline (speed enhancement can be expected)
- Word-KD, Seq-KD, Seq-Inter are complementary to each other
Speed
- 2 x 500 LSTM with greedy search (1051.3) is 10 times faster than 4 x 1,000 LSTM with beam-search

Pruning
- Network pruning can be complementary, and further reduces param count with little loss of performance

Personal Thoughts

Applicable to Papago NMT
- Faster decoding time with less storage (possibly on mobile phone)

Link : https://arxiv.org/pdf/1606.07947.pdf
Authors : Kim et al. 2016

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Abstract

BeamSearch is an approximate inference model for intractable inference function in Sequence Models, but it often produces nearly identical sequences on top beams
Propose Diverse Beam Search (DBS) that decodes a diverse output by optimizing for a diversity-augmented objective
First paper that I've read on OpenReview, which got rejected by a small margin. I should be reading open-reviews carefully.

Details

Beam Search
- Maximum a Posteriori (MAP) inference for RNNs is the task of finding the most likely output sequence given the input. Since the number of possible sequences grows as |V|^T, exact inference is NP-hard so approximate inference algorithms like Beam Search (BS) are commonly employed. BS is a heuristic graph-search algorithm that maintains the B top-scoring partial sequences expanded in a greedy left-to-right fashion.
- At each step, select top-k beams that have maximum cumulative log probability scaled by length penalty
Diversity in Beam Search
- inherently, lacks the diversity in top-k beams

Diverse Beam Search
- optimize an objective that consists of two terms – the sequence likelihood under the model and a dissimilarity term that encourages beams across groups to differ. This diversity-augmented model score is optimized in a doubly greedy manner – greedily optimizing along both time and groups.
- Partition B beams into G Groups where each group is given dissimilarity term related to previous groups, hence similar token as previous groups are discouraged.
- At each step, select top-(k/G) beams in each group that have maximum sum of cumulative log probability and dissimilarity term scaled by length penalty

Result
- Evaluates on Oracle BLEU which is best bleu score among top-k beams
- This evaluation is not significant when NMT only requires top-1 beam to be the output during inference. This is where this paper had its weakness in convincing the reviewer.

Diversity score is higher with larger groups (obviously), and Hamming diversity works best for dissimilarity function

Personal Thoughts

Well-written paper, but sad to see that this much effort did not pass ICLR 2017
Cannot be used in Papago because what we need is a decoder that generates better translation on top-1, not a decoder that can give us diverse candidates
Perhaps, useful in generating multiple target sentences from teacher-model during knowledge distillation. NMT's biggest problem is that the single answers are given to the source-target pair which can have multiple answers, which leads to inadequate penalties serving as a noise.
- Let's implement it and generate one-source-multiple-target NMT corpus

Link : https://arxiv.org/pdf/1610.02424.pdf
Github Link : https://github.com/ashwinkalyan/dbs
Authors : Vijayakumar et al. 2016

SYSTRAN Purely Neural MT Engines for WMT2017

Abstract

SYSTRAN’s submission to the WMT 2017 shared news translation task for English-German
Back-translation and Hyper-specialization
uses OpenNMT

Details

WMT 2017 News Translation Task
- Data 4.6M Parallel corpus

Training
- Nvidia GTX 1080 ~ 64 per minibatch
- SGD (0.1) with annealing rate (0.7)
Back Translation
- translating target language back into source language and using it as parallel corpus
- Synthetically generated back-translated data 4.5M + original 4.5M after 13 epochs of original 4.5M training
- it improves performance!

Data Selection vis LM model
- Less data are used to fine-tune model
- data is chosen by two 3-gram LM model trained one from news corpus and one from random sampling. When the difference of cross-entropy is big, we treat it as news related sentence and include in fine-tune corpus
Hyper-specialization
- 25K news related set tuned with learning rate 0.7
- improves BLEU by +0.3~0.5

Personal Thoughts

Good to see Systran openly participating and contributing to WMT2017
Amount of data is really strong, when generated via back-translation, distillation, monolingual!
Hyper-specialization is competition-fit strategy for squeezing the performance ~ likely overfitting

Link : https://arxiv.org/pdf/1709.03814.pdf
Authors : Deng et al. 2017

What does Attention in Neural Machine Translation Pay Attention to?

Abstract

Analyze how attention is similar or different from the traditional alignment (word)
Result : attention is different from alignment in some cases and is capturing useful information other than alignments (hence it's different)

Details

Using RWTH De-En data, measure attention-alignment accuracy
Average attention loss/Average word prediction loss on POS tags of target side
- Noun and Verbs are most common POS tags
- Noun has lowest average attention loss, meaning attention is similar to alignment
- Verb has 2x attention loss, meaning attention is quite different from alignment in verbs
Correlation between WPE and Attention loss for input-feeding model
- Low correlation for verbs confirm that attention to other parts of the source sentence rather than the aligned word is necessary for translating verbs

Personal Thoughts

Of course, attention is not an alignment
Attention is a mix of context and meaning, it holds all the necessary components for the task in secret way
how is attention different in character-level/sub-word NMT?

Link : https://arxiv.org/pdf/1710.03348.pdf
Authors : Ghader et al 2017

Exploring Sparsity in Recurrent Neural Network

Abstract

Propose a pruning method during the training of the network
Time to train the model remains constant, and network size is reduced by 8x
Pruning a larger dense network is better than dense network with same parameter size

Details

Pruning Methodology
- maintains a set of masks, a monotonically increasing threshold and a set of hyper parameters that are used to determine the threshold
- this binary mask is multiplied to the weight every time. All weights get trained and updated (mask dynamically changes)
- Prune fully connected layers and recurrent states. Do not prune bias or normalization parameters.
- Setting hyper parameters
Pruning Algorithm
- Observe the types of layers and number of parameters for your network
Results
- Sparse model starting with larger model performs on par with dense model with 8x reduction in param number
- Speed-up is not as expected, due to the inefficient support for cuSPARSE libraries yet to come
Pruning percent
- Lower layers are pruned more aggressively
- (b) shows the number of pruned params increasing exponentially, which performs better than linear one

Personal Thoughts

Good pruning method, easy to implement
Hope the speed-up is also as much as storage size improvement

Link : https://arxiv.org/pdf/1704.05119.pdf
Authors : Narang et al. 2017

kweonwooj / papers Goto Github PK

papers's People

Contributors

Stargazers

Watchers

Forkers

papers's Issues

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract

Details

Personal Thoughts

Abstract