Yes for case of soft attention : somewhat mixed result across tasks.
Active memory operate on all of memory in parallel in a uniform way, bringing improvement in algorithmic task, image processing and generative modellings.
Deos active memory perform well in machine translation? [YES]
Details
[Attention]
Only a small part of the memory changes at every step, or the memory remains constant.
Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nautre of softmax.
[Active Memory]
Any model where every part of the memory undergoes active change at every step.
[NMT with Neural GPU]
parallel encoding and decoding
BLEU < 5
conditional dependence between outputs are not considered
[NMT with Markovian Neural GPU]
parallel encoding and 1-step conditioned decoding
BLEU < 5
possibly, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance
[NMT with Extended Neural GPU]
parallel encoding and sequential decoding
BLEU = 29.6 (WMT 14 En-Fr)
active memory decoder (d) holds recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.
[CGRU]
convolutional operation followed by recurrent operation
stack of CGRU expands receptvie field of conv op
output tape tensor acts as external memory of decoded logits
Personal Thoughts
Same architecture, but encoder and decoder hidden states may be doing different things
encoder : embed semantic locally
decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
Will it work for languages with different sentence order?
What part of the translation problem can we treat as convolutional?
Is Transformer a combination of attention and active memory?
Utilize dual tasks that have intrinsic connections with each other due to the probabilistic correlation (En-Fr vs Fr-En translation, Speech Recognition vs Text to Speech, Image Classification vs Image Generation)
Propose dual supervised learning method that trains dual tasks simultaneously.
Improves performance of both tasks
Details
Conditional distributions of the primal and dual tasks should ssatisfy the following equality :
Add probabilistic duality term in loss function as specified as below :
lambda_xy are hyperparameters and best performance is obtained with lambda near ~0.01. It shows that the effect of probabilistic duality is quite small.
Personal Thoughts
Utilizing duality of the tasks is clever and practical in theory, it will lead to more data.
The improvement, however, seems limited.
Link : https://arxiv.org/pdf/1707.00415.pdf
Authors : Yingce Xia (School of Information Science and Technology, University
of Science and Technology of China, Hefei, Anhui, China) et al. 2017
Introduce Key-Value Memory Network that makes reading documents more viable by utilzing different encodings in the addressing and output stages of the memory read operation
Achieves SOTA on existing WikiQA benchmark
Details
QA task has been directed toward using Knowledge Bases (KBs), which has proven effective, but it suffers from being too restrictive, as the schema cannot support certain types of answers, and too sparse.
Key-Value Memory Network is an extension of Memory Network.
Knowledge source is cumulatively added to context
question is embedded as query, which does inner product with key(context), and the softmax output is weighted on values(content).
In KVMemNet, memory slots are pairs of vectors
Personal Thoughts
QA task actively uses attention to achieve better scores, but are not fully applicable/related to NMT
NMT in completely unsupervised manner, relying on nothing but monolingual corpora.
Earlier works include triangulation and semi-supervised learning which still requires a strong cross-lingual signal
Shared Encoder + Denoising Autoencoder + Language specific Decoder
Details
Motivation
A parallel corpora with good quality is difficult/expensive to acquire, whereas monolingual corpus is relatively easier.
Low-resource languages or unique language pairs cannot afford to have a quality-parallel-corpora sufficient to train NMT model
Unsupervised NMT
fixed cross-lingual embedding is obtained via word2vec
Shared encoder encodes the meaning of the sentence, noise is included for robust learning
L1 decoder (De-noising autoencoder) attempts to re-produce the input, trying to learn the latent variables of inputs
L2 decoder (Language-specific decoder)
given En input to shared encoder, L2 decoder for English can be learned.
Training takes place by alternating training objective L1/L2 from batch to batch.
Result
Simple de-noising makes model learn to copy, instead of translate
adding back-translation (L2 decoder) significantly improves the performance
BPE helps to tackle unknown words, although still weak in named entities
With small amount of parallel corpus (semi-supervised), performance can be improved further
Analysis
Quantitatively, BLEU is still much lower (about BLEU 10 points) than SOTA supervised models, but translation in unsupervised manner does work.
Qualitatively, UNMT proves it goes beyond a literal word-by-word substitution and correctly translates structural differences between languages, a good sign
Personal Thoughts
my idea on VAE + GAN model is very similar to unsupervised NMT
Interested in seeing the shared encoder work
Papago needs them, we can extend the language with abundant monolingual data
Comprehensive Technical Overview and Empirical Results of NMT in Systran
12 languages, for 32 language pairs
Details
[Corpus] Utilize 3 corpora for each language pair
a baseline corpus (1 million sentences) for quick experiments (day-scale)
a medium corpus (2-5M) for real-scale system (week-scale)
a very large corpora with more than 10M segments
[Train Epoch]
In Junczys-Dowmunt et al. 2016, authors mention using corpus of 5M sentences and training of 1.2M batches each having 40 sentences – meaning basically that each sentence of the full corpus is presented 10 times to the training.
In Wu et al. 2016, authors mention 2M steps of 128 examples for English–French, for a corpus of 36M sentences, meaning about 7 iterations on the complete corpus.
In our framework, for this release, we systematically extended the training up to 18 epochs and for some languages up to 22 epochs.
[PlaceHolder]
In most language pairs, our strategy combines a vocabulary shortlist and a placeholder mechanism
named entity placeholders (number, name, currency, url etc)
[Vocab]
For enko and jaen, used BPE to reduce the vocabulary size but also to deal with rich morphology and spacing flexibility that can be observed in Korean.
Empirical results on training NMT in large scale E-commerce setting by Booking.com
Covers optimization, training and evaluation
Details
Model Architecture
4-layer LTSM written in Lua
Use global attention
Use "case" embedding feature
0.3 residual
no batch size indicated
Handles named entity by pre-processing the input, detecting NE-tag in both sentences and replacing it with placeholder and simply copying it via attention map
Optimizer
1M En-De dataset
SGD vs Adam vs Adagrad vs Adadelta (1.0, 0.0002, 0.1, 1.0)
SGD performs best
Multi-GPU
Async vs Sync Multi-GPU
single GPU performs best ~ opposite of our in-house result
Corpus Size
1M, 2.5M, 5M, 7.5M, 10M corpus ran 90M iterations
10M performs best after-all, with higher human eval which is latent in BLEU score (more data, the better it is)
Evaluation
Adequacy + Fluency metric
Personal Thoughts
Solid works and experiments on NMT
In-house data seems to be abundant and strong
good to see that they openly publish their results
Fully Unsupervised NMT using Monolingual Corpora only by FAIR
De-noising Auto-encoder + Language specific Decoder + Language Discriminator
Good paper on ICLR 2018
Enables better NMT for low-resource language pairs
Performance is still blocks below supervised NMT
Details
Key Idea
Build a common latent space between the two languages
Learn to translate by reconstructing in both domains according to two principles
(i) the model has to be able to reconstruct a sentence in a given language from a noisy version of it, as in standard de-noising auto-encoders
(ii) The model also learns to reconstruct any source sentence given a noisy translation of the same sentence in the target domain, and vice versa
Learning Objective
De-noising Auto-Encoder : Embed sentence into latent space with noise and reconstruct it back
- Cross-Domain : Minimizing loss for (Source in lang1 -> Latent Space -> Reconstructed Target in lang2 -> Back into Latent Space -> Reconstruct Source in lang1 )
- Adversarial : Discriminator tries to identify the language by seeing the embedding in latent space, the model tries to fool by mapping same semantic sentences into same latent space in language independent manner
- Final Objective Function
Training
Model starts with an unsupervised word-by-word translation in an unsupervised way
Encoder tries to map the source sentence with noise into shared latent space, and reconstruct as in de-noising auto-encoder.
Decoder learns to reconstruct the input from the latent space, given a language flag
Discriminator tries to identify the source language in an adversarial setting
Model Selection
BLEU score for two-way translation is used as a evaluation metric
shows good correlation with classic BLEU
Results
Not sure the baselines were really meaningful
Unsupervised does learn something!
Monolingual vs Parallel Corpus
10M Monolingual ~ 100K Parallel Corpora
Ablation Study
dropping subset of training scheme to see which part is critical in learning
De-noising Auto-Encoder and Cross-Domain are both critical
Personal Thoughts
Great work of Unsupervised NMT
Better than Cho's paper because it is fully differentiable
StarSpace treats all features as embedding, a set of features (entity) as bag-of-features (also embedding) and optimizes w.r.t. to the similarity of label which is also an embedding.
Use of positive generator and negative generator stabilizes the learning mechanism.
widely applicable and shows strong performance in text classification, embedding etc
Personal Thoughts
can better embedding help performance of translation?
Decoder in NMT is simply rule-based, one can improve its performance by learning to decode
Trainable greedy decoder learns to manipulate the hidden state of a trained neural translation system with an arbitrary decoding objective (BLEU or perplexity)
trained by a novel variant of deterministic policy gradient, called critic-aware actor learning.
Details
Much of the research on neural machine translation has focused solely on improving the model architecture, not on decoding
Cho (2016) showed the limitation of greedy decoding by simply injecting unstructured noise into the hidden state of the neural machine translation system
Uses Deterministic Policy Gradient with Critic-Aware Actor Learning for stable learning algorithm
Personal Thoughts
must read Cho's paper on NPAD
must learn about reinforcement learning (actor-critic, policy gradient etc)
mathematical formula for loss function of actor-critic model and model figures (Fig.1) are difficult to understand
Propose Tacotron, an End-to-End Text-to-Speech Seq-to-Seq Model with Attention with <text, audio> data
Frame level generation, much faster than sample level auto-regressive model
Details
Modern TTS models are complex and modular
classic : text extraction, feature extraction, acoustic model, and vocoder
Wavenet : slow due to its sample-level autoregressive nature, also requires conditioning on linguistic features from an existing TTS frontend, hence not end-to-end
Introduces Adaptive Computation Time (ACT), an algorithm that allows RNN to learn how many computational steps to take between receiving an input and emitting an output
Experimental results on four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers show that performance is dramatically improved by the use of ACT.
In character-level language modelling on the Hutter prize Wikipedia, ACT does not yield large gains in performance but it provides insight into the structure of the data, with more computation allocated to harder-to-predict transitions such as spaces between words and ends of sentences.
Details
The approach pursued here is to augment the network output with a sigmoidal halting unit
whose activation determines the probability that computation should continue.
RNN vs RNN with ACT
In short, it uses intermediate states which are activated by sigmoidal halting unit dynamically.
Timing penalty is applied to minimize the amount of pondering when not necessary.
Exact formulation and understanding of components must be revisited..! (re-read)
Experiment on Parity Error
ACT does have an impact on lowering sequence error rate, with less penalty, it ponders more
Experiment on Wikipedia Character Prediction
ACT with time penalty has negligible effect, but with lower time penalty, model ponders more and especially on spaces and eos.
Personal Thoughts
Beautiful vizualization
I'm not sure how ACT is utilizing attention well yet..
No direct link to usage in NMT
But Alex Graves is the must-read and must-understand researcher
Propose to pre-train the encoder and decoder of seq2seq model with the trained weights of two language models
An additional language modeling loss is used to regularize the model during fine-tuning
SOTA in WMT 2014 English->German
Details
Basic Methodology
Train LM_src, LM_tgt using monolingual data
Use LM-pre-trained model's Embedding Layer, 1st LSTM Layer and Softmax Layer in Decoder
Improvement
Add monolingual loss during training to preserve the feature extractor of LMs (one can freeze pre-trained weights for few epochs and train the whole weight later too)
Residual connection to mitigate initial noise corrupting the pre-trained weights
Multi-layer attention, extracting both low and high level contextual information
Result
Outperforms Back-translation
Ablation Study
Pre-training decoder is better, because decoder does more difficult job of keeping semantics and generating sentence with correct syntax
Gains of pre-training greatly overlap
LM objective serves as a big regularizer, it assures fluency for sure
Personal Thoughts
Why only benchmark back-translation?
How can I train Language Model with internal monolingual data?
Authors did good and careful job of not corrupting pre-trained weights, knowing that LM weights provide good fluency features
Using monolingual corpora in low-resource NMT by adding copied-corpus in training data
BLEU benefits around ~1.2 in Turkish > English and Romanian > English translation task
Details
Related Works
Back Translation by Sennrich et al. 2016: train target->source NMT to perform translation of target->source on target monolingual corpora, and resulting parallel corpora is combined with original parallel corpora.
Multi-task systems by Johnson et al. 2016: combining multiple translation directions (French->English, German->English etc)
This paper proposes using simple copy of target monolingual corpora to obtain parallel target->target corpora, and use it as training data
Amount of Resource per language pair
Performance improves with Low-Resource pairs
On the contrary of assumption,
fluency does not improve with copied corpus added, shown by perplexity in language model
the translation of pronouns, named entities and rare words improve, shown by pass-through accuracy
Amount of Monolingual Data
Even with 3:1 ratio of monolingual to parallel corpora, BLEU increases.
Monolingual copied corpora does not hurt learning even its ratio is relatively high
Personal Thoughts
Lesson : Low-Resource NMT techniques work when parallel corpus below 1M
Simple and Elegant method with incremental result, but not confident that BLEU +1.2 is a significant improvement in quality
Propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold
Although there are lots of effort in network architecture, learning algorithms and novel applications, decoding is not well studied
Details
Recurrent models are de facto for linguistic tasks (language models, machine translation, dialogue, question answering etc)
Noisy Parallel Approximate Decoding (NPAD)
meta-algorithm that runs in parallel many chains of the noisy version of an inner decoding algorithm(greedy or beam search)
fully parallelizable, speed is almost equivalent to single greedy search
a neighborhood in the hidden state space corresponds to a set of semantically similar configurations in the input space, regardless of whether those configurations are close to each other in the input space
- adding Gaussian noise in calculation of logit, start with high level of noise and anneal it as the decoding progresses
- among M hypotheses, select the argmax of log probability
Why not sampling?
more efficient to sample in the hidden state space than in the output state space
hidden space 'fills in' the semantically similar neighbors, whereas state space is more sparse
Different from Diverse Decoding
Diverse decoding is applicable to beam search only, and is conditional on previous beams
NPAD is independent and parallelizable
Stochastic Sampling vs NPAD
stochastically sampling from final softmax distribution does improve upon greedy beam search, but NPAD with correct parameter outperforms both NLL and BLEU
Even though pruning methods reduce number of parameters by 90%, the speed-up is less than expected on different hardware platforms due to indexing overhead, irregular memory access and inability to utilize array data-path
Propose pruning weights in block format, train with group lasso regularization to encourage sparsity in the model
10x smaller parameters with ~10% loss of accuracy
Details
Block Prune
prune blocks of a matrix instead of individual weights, if maximum magnitude of a block is less than threshold
Pruning during Training
this method actively prunes the parameters during training
Hyperparameters for pruning are
Group Lasso Regularization
L2 loss of group of weights
Experiments
Speech Recognition system with CNN, RNN and FC layers
< 5% loss of accuracy obtained when model parameter size is reduced by 1/3 ~ 1/4
BP (block pruning), GLP (group lasso regularization with block pruning)
Speed-up
Block Pruning has higher speed-up when batch is big
Pruning Schedule
BP and GLP actively prunes than ordinary Weight Pruning
Performance over Prune ratio
sudden decrease in performance after 90% threshold
lower layers are pruned more than higher layers
Personal Thoughts
wanted to see pruning in NMT
batch=1 has speed-up of ~3, wonder how they implemented it.
if op is sparse, then do I have to code new inference nmt.py?
Parameter settings in experiments were quite odd..
not sure what the real message is, number of hidden sizes and resulting number of parameters are just all over the place
Propose a novel architectural unit, “Squeeze-and-Excitation”(SE) block, that adaptively re-calibrates channel-wise features in CNN.
Ensemble of SENets won 1st place in ILSVRC 2017 classification task with top-5 error rate 2.251% (~25% improvement from last year)
Details
Squeeze unit
global average pooling over channel (describes the whole channel in single number)
Excitation unit
2 x FC layers with dimension reduction and dimension rescale (ratio of 16) with ReLU and sigmoid is multiplied to scale the original input
Can easily be inserted into existing SOTA CNN architectures. (ResNet, ResNeXt, Inception etc)
only fraction of parameter increase (<10% of whole param)
Figure 8. shows that lower layers have almost identical distributions across labels, upper layers have meaningfully different distributions across labels, and last layer seems to be saturated.
Personal Thoughts
Very clever idea of tweaking channel-wise inter-dependency
Applicable to re-calibrate channels of encoder states before they go into attention
Great work by self-driving car start-up, Momenta.ai in Beijing
Propose to speed up the decoder by applying a more flexible beam search strategy whose candidate size may vary at each time step depending on the candidate score
10% speed up in beam_size=5 without loss of accuracy
Details
Beam Search is disadvantageous
Less adaptive, it expands candidates whose scores are much worse than the current best
It discards hypotheses if they are not within the best scoring candidates, even if the scores are close
Search Strategies
Relative Threshold Pruning
relative threshold compared to the best candidte
Absolute Threshold Pruning
Discard candidates less than absolute margin than best candidate
Relative Local Threshold Pruning
Consider the score of last generated word only in pruning
Max Candidates per Node
Fix the number of candidates with same history in each time step
Fan Out per Sentence
fan out : number of candidates we expand
Original BeamSearch has linear fan out
Proposed BeamSearch adaptively reduces the number of fan outs
Results
With beam_size=5, 10~13% speed improvement with using all the proposed methods
Personal Thoughts
Better decoding speed without hurting performance, this is what I've wanted!
Train source-to-target NMT (student) without parallel corpora available, guided by the existing pivot-to-target NMT (teacher) on a source-pivot parallel corpus
X : source, Y : target, Z : pivot
Details
Related Works
Triangulated pivot-based method (X->Z, Z->Y), exposed to error propagation issue
multilingual (shared encoder/decoder structure)
Teacher-Student Approach
based on translation equivalence assumption (similar to knowledge distillation - sequence level)
Sentence-Level Teaching
Assumption : If a source sentence x is a translation of a pivot sentence z, then the probability of generating a target sentence y from x should be close to that from its counterpart z
Minimizing KL divergence of two distribution leads to good translation from X to Y
Remove teacher model parameter, they are fixed
Training objective is to minimize the below
Word-Level Teaching
Similarly, training objective is to minimize the below
Result
Word-sampling outperforms sentence-beam
Personal Thoughts
Does not have to go with absolute zero resource, rather try learning with monolingual only
Very similar to knowledge distillation, using the distribution learnt in the translation model to learn another student model
Neural Turing Machines (NTMs) learn an algorithm from examples, that are fully differentiable computers that use backpropagation to learn their own programming.
Propose Neural GPU, a type of convolutional gated recurrent unit that is highly parallel and efficient to train.
Neural GPU can be trained on short instances of an algorithmic task(addition and multiplication) and successfully generalize to long instances.
Details
Use CGRU (Convolutional Gated Recurrent Unit) that is similar to GRU with convolutional kernel added as main block.
Good performance in addtion and multiplication, good generalization to longer sequences.
Great effort in optimization process
Grid Search : 3^6 = 729 parameters
Curriculum Learning : train n-digit number only after making 90% performance in (n-1)-digit number
Gradient Noise : add Gaussian noise to gradient, multiplied by fraction of non-fully-correct output
Gate cutoff : cutoff for sigmoid function
Parameter Sharing Relaxation : let hidden units of RNN be different and gradually converge to single parameter (Relaxation was critical in fitting the training data)
Personal Thoughts
Great engineering and effort in optimization process
Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training, but these traditional approaches fail to explain why large neural networks generalize well in practice.
Experiment with SOTA CNN models on image classification with SGD, fitting a random training data.
Details
Deep neural networks easily fit random labels.
Effective capacity of the model is big enough to fully memorize the randomized training data.
Inception V3 (with dropout and weight decay) fits perfectly to random training data - not truly generalizing.
[Summary] Both explicit and implicit regularizer, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.
l2 norm is not absolute : weights with higher l2 norm have better generalization than weights with lower l2 norm
SGD acts as an implicit regularizer
On small data sets that even Gaussian kernel methods can generalize well with no regularization.
Early stopping has potential of regularization - the effect is case-by-case.
Personal Thoughts
Understanding neural network is difficult, because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
Good models are models that generalize well, where is this good generalization really coming from?
Adding structure to Attention module - a linear chain conditional random field and a graph-based parsing model
Experiments on tree transduction, neural machine translation, question answering and natural language inference shows better performance and improved behavior
Details
Attention is a function of key, value and query where key holds the whole context, query holds the context to be answered and value holds relevant contents.
In author's word, attention mechanism is the expectation of an annotation function with respect to a latent variable which is parameterized to be function of source and query.
Segmentation Attention
using linear-chain CRF with pairwise edges, it adds pairwise structure
Syntactic Attention
using graph-based parsing model, it adds tree-like structure
End-to-End training
forward pass is simple
backprop is not fully optimized with off-the-shelf tools
training takes 5x slower than simple attention mechanism, inference is almost similar
Neural Machine Translation
EnJa data from WAT, 500k sentences, less50
character-level and word-level with vocab cut-off of 10
result is not significant in word-level, slight increase in character-level
Visualization of attention : shows richer, denser attention when structure is added
Personal Thoughts
Agree on enriching the attention mechanism is a good are of research
not sure EnJa from WAT was good benchmark corpus, no significant improvement
too much information is compressed into attention mechanism
even the single token holds distributed context/content from surrounding.
Propose a novel encoder-decoder-reconstructor framework for NMT, utilizing target-source information as additional feedback
Details
New training objective, adding reconstruction error with lambda = 1
- Raises issue of sub-optimality in decoding likelihood objective, and empirically shows performance improves even with very large beam size (1000), but not quite sure this is important. In practice, beam size < 10 is enough.
Personal Thoughts
Surprised that BLEU improvement is small (Is BLEU +1.5 really significant?)
Use of target-source information in parallel corpus was impressive
Use influence function to trace a model's prediction back to its training data.
Approximation of influence function that requires gradients and Hessian vectors provides valuable information
Useful in debugging models and detecting dataset errors
Details
Using influence function, one can ask questions such as "What is the model parameter like when certain training data was missing/altered?" without re-training the whole model
Useful in detecting adversarial examples
Useful in fixing mislabeled examples by providing good candidate lists, but limited boost compared to simple listing via highest training loss
Personal Thoughts
Understanding neural network is difficult, because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
Good approximation methods are always powerful and applicable
Use attention distribution to evaluate the translation quality
Use attention-filtered synthetic data added to existing parallel corpus to improve NMT translation quality in BLEU
Details
Attention-based Metrics
Coverage Deviation Penalty
aims to penalize the sum of attentions per input token for going too far from 1
Absentmindedness Penalty
dispersion of attention is measured via the entropy of the predicted attention distribution. Again, we want the penalty value to be 1.0 for the lowest entropy and head towards 0.0 for higher entropies
Training NMT with additional data
back-translation is good
but, attention-filtered synthetic data is also better
it helps especially in morphologically rich -> weak language direction
Personal Thoughts
Making data more rich and smoother by back-translation, copied-corpus, seq-level knowledge distillation and attention-filtered synthetic corpus is very strong
Comprehensive explanation of WMT17 Tasks and their results
Machine Translation tasks : news, biomedical and multimodal
Evaluation tasks : metric and run-time estimation of MT quality
Automatic Post-Editing task
Neural MT training task
Bandit Learning task
Details
Main in MT task (news)
eval set is 1,500 per language direction. Total 3,000 sentences.
Evaluation of Direct Assessment (DA) via crowd-sourcing was impressive
requires monolingual background
careful control variables (replica of ref, real MT output, altered MT output) to filter out unreliable crowd (gamer)
pearson correlation of 0.97+ with professional researcher's result
Personal Thoughts
WMT is a good conference.
Decisions being made such as continual usage of HTER as metric where HTER has discrepancy between BLEU and human evaluation for future reference, careful experimentation and validation of DA metric are mature
Helsinki NMT ranked 1st inWMT2017 News Translation task in English-Finnish
Details
Arsenals
Layer Normalization : preliminary experiments showed no improvement
Variational Dropout : dropout in recurrent states
Context Gates : achieved better cross-entropy, but no improvement in BLEU or chrF3
Coverage Decoder : preliminary experiments showed no improvement
Ensemble : Proper ensemble is best, but Parameter Averaging also helps
Experiments
Choice of Segmentation Strategy
BPE in decoder performs well, char-level decoder performs high in chrF3
- Ensemble
- Proper ensemble is best, but parameter averaging helps
- In dev set, they found lots of contractions (wouldn't etc) which were not present in training set, so they de-tokenized them
Personal Thoughts
Lots of ideas, tested in preliminary baseline and applied effective ones in large scale data
Language specific tunings such as dev set tuning and exhaustive search in enc/dec segmentation strategy, which lead to first place in English-Finnish task
Analyze the representations learned by neural MT models through part-of-speech and morphological tagging tasks
Parameters include : word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations
Details
Train NMT model and use hidden layer units to perform other linguistic task in order to compare the quality of representation in each parameter variables.
Findings
char-level are better at learning morphology than word-level
Lower layers of the encoder are better at capturing word structure, while deeper networks improve translation quality, suggesting that higher layers focus more on word meaning.
Morphologically rich -> poor translation is difficult, the opposite is simpler task
decoder learns very little about the word structure. When attention used, it learns even less.
Personal Thoughts
Good methodology, trying to understand the amount of information in the representation via looking at performance in other task was unique
Good experiments, visualizations and explanations to the phenomenon
A capsule network proposed by Geoffrey Hinton, using layer-wise parallel attention
Insight from attention in human vision where irrelevant details are ignored via sequence of fixation points
Activities of the neuron in an active capsule represent the various properties of a particular entity that is present in the image (position, thickness, size, orientation, deformation etc)
Details
Routing Algorithm
Existing CNN simply max-pools the single scalar from matrix of numbers to extract the most impressive traits
Capsules pools information from previous layer's capsules via dynamic routing algorithm
Routing Softmax determines the initial from L to L+1 connectivity -> input in L+1 is calculated via weighted sum -> input in L+1 is squashed to 0~1 range -> routing logit from L to L+1 is updated by L's prediction and squash (~attention mechanism)
Architecture
Simple 3-layer CapsNet with routing connection between Primary caps and DigitCaps only
Result on MNIST
better than CNN
What CapsNet learns
each dimension in DigitCaps do learn some properties
Non-autoregressive NMT allows an order of magnitude lower latency during inference.
Through knowledge distillation, the use of input token fertilities as a latent variables, and policy gradient fine-tuning, they achieve 2.0 BLEU points lower than autoregressive Transformer network used as a teacher
Details
Towards non-autoregressive decoding
naive method is to predict each output independently, but this does not yield good result due to multimodality problem, a conditional independence between various target sentences.
for example, 'thank you' in English can be translated into 'danke schon' or 'vielen dank', but 'danke dank' is not acceptable.
Non-Autoregressive Transformer (NAT)
Encoder stack
encoder stays unchanged from original Transformer
Decoder stack
Decoder inputs
copy source inputs using fertilities
Fertilities mean the number of times each input is copied into decoder inputs, it controls the "speed" at which decoder translates and determines the length of target sentence
Non-causal self-attention
since no masking is needed for later tokens, simply use self-attention with masking out each query position only from attending to itself
Positional attention
include additional positional attention module in decoder layer, which adds stronger positional signal and allows decoder to perform local reordering
Fertility
a latent variable that models the nondeterminism in the translation process
sample z from a prior distribution and then condition on z to non-autoregressively generate a translation
with max = 50
Conditional Probability of a Target Translation, Y is
Translation Predictor and Decoding Process
searching for all combinations of fertility is intractable, so
choose argmax fertility for each input word
choose expected value for fertility
noisy parallel decoding, samples random fertility sequence and decode in parallel, then choose argmax
Training (read again)
Sequence-level Knowledge Distillation
use teacher-generated target corpus for training, original corpus results in too noisy and nondeterministic
Fine-tuning stage with reverse KL divergence with teacher output distribution in a form of word-level knowledge distillation
KD is favorable towards highly peaked student output distributions than a standard cross-entropy error would be
Joint training (read again)
sum of original distillation loss, expectation over fertility distribution, normalized with a baseline and other based on external fertility inference model via policy gradient and backprop
Experiments
IWSLT16 En-De as development
WMT14 En-De and WMT16 En-Ro as final result verification
use shared BPE, for IWSLT use separate vocab and embedding
transfer student's encoder weights from encoder weights from its teacher
fertility prediction is supervised with fast_align by IBM Model 2 during training
Results
NAT 2~5 BLEU less than AT
Speed up for > x15 over beam search in teacher model
Good experimental scheme
Fine-tune does not converge with RL or BP alone, must use all three fine-tuning terms to get +1.5 BLEU
Overall Structure
src_len vs Latency
learning curve for NAT (bleu on dev set)
Personal Thoughts
Contributions
Non-autoregressiveness
NPD is strong, let's implement
They tried all ideas, saw their poor performance and tried another
analysis of deterministicness, approximating distribution, policy gradient, roles of each module is still unkown to me
for naver_dic, try to reduce hyperparameters (model size, hidden size, layer, head, warmup step)
use their viz scheme (src_len vs latency, learning curve)
Experiment three types of knowledge distillation : word-level, sequence-level, sequence-level interpolation
Best student model runs 10 times faster than its sota teacher with little loss in performance
Network pruning can further reduce parameters with little loss in performance
Details
Overview of Three types of Knowledge Distillation
Word-level KD
student-model learned on the word-level cross-entropy loss of teacher model
Sequence-level KD
student-model learned on the new training set that is a beam-search result of teacher model
beam-search result of teacher model is a tractable approximation of sequence-level knowledge
Sequence-level Interpolation KD
uses original training data and beam-search result of teacher-model simultaneously to balance the objective function
Experiment on En-De
4M training data, 50k word-level
4 x 1,000 LSTM as teacher-model
2 x 300 LSTM and 2 x 500 LSTM as student-model
Result
Seq-KD on 2 x 500 LSTM has better BLEU on greedy-search, comparable BLEU on beam-search and note that greedy-search result is 16.9% compared to 1.3% in baseline (speed enhancement can be expected)
Word-KD, Seq-KD, Seq-Inter are complementary to each other
Speed
2 x 500 LSTM with greedy search (1051.3) is 10 times faster than 4 x 1,000 LSTM with beam-search
Pruning
Network pruning can be complementary, and further reduces param count with little loss of performance
Personal Thoughts
Applicable to Papago NMT
Faster decoding time with less storage (possibly on mobile phone)
BeamSearch is an approximate inference model for intractable inference function in Sequence Models, but it often produces nearly identical sequences on top beams
Propose Diverse Beam Search (DBS) that decodes a diverse output by optimizing for a diversity-augmented objective
First paper that I've read on OpenReview, which got rejected by a small margin. I should be reading open-reviews carefully.
Details
Beam Search
Maximum a Posteriori (MAP) inference for RNNs is the task of finding the most likely output sequence given the input. Since the number of possible sequences grows as |V|^T, exact inference is NP-hard so approximate inference algorithms like Beam Search (BS) are commonly employed. BS is a heuristic graph-search algorithm that maintains the B top-scoring partial sequences expanded in a greedy left-to-right fashion.
At each step, select top-k beams that have maximum cumulative log probability scaled by length penalty
Diversity in Beam Search
inherently, lacks the diversity in top-k beams
Diverse Beam Search
optimize an objective that consists of two terms – the sequence likelihood under the model and a dissimilarity term that encourages beams across groups to differ. This diversity-augmented model score is optimized in a doubly greedy manner – greedily optimizing along both time and groups.
Partition B beams into G Groups where each group is given dissimilarity term related to previous groups, hence similar token as previous groups are discouraged.
At each step, select top-(k/G) beams in each group that have maximum sum of cumulative log probability and dissimilarity term scaled by length penalty
Result
Evaluates on Oracle BLEU which is best bleu score among top-k beams
This evaluation is not significant when NMT only requires top-1 beam to be the output during inference. This is where this paper had its weakness in convincing the reviewer.
Diversity score is higher with larger groups (obviously), and Hamming diversity works best for dissimilarity function
Personal Thoughts
Well-written paper, but sad to see that this much effort did not pass ICLR 2017
Cannot be used in Papago because what we need is a decoder that generates better translation on top-1, not a decoder that can give us diverse candidates
Perhaps, useful in generating multiple target sentences from teacher-model during knowledge distillation. NMT's biggest problem is that the single answers are given to the source-target pair which can have multiple answers, which leads to inadequate penalties serving as a noise.
Let's implement it and generate one-source-multiple-target NMT corpus
SYSTRAN’s submission to the WMT 2017 shared news translation task for English-German
Back-translation and Hyper-specialization
uses OpenNMT
Details
WMT 2017 News Translation Task
Data 4.6M Parallel corpus
Training
Nvidia GTX 1080 ~ 64 per minibatch
SGD (0.1) with annealing rate (0.7)
Back Translation
translating target language back into source language and using it as parallel corpus
Synthetically generated back-translated data 4.5M + original 4.5M after 13 epochs of original 4.5M training
it improves performance!
Data Selection vis LM model
Less data are used to fine-tune model
data is chosen by two 3-gram LM model trained one from news corpus and one from random sampling. When the difference of cross-entropy is big, we treat it as news related sentence and include in fine-tune corpus
Hyper-specialization
25K news related set tuned with learning rate 0.7
improves BLEU by +0.3~0.5
Personal Thoughts
Good to see Systran openly participating and contributing to WMT2017
Amount of data is really strong, when generated via back-translation, distillation, monolingual!
Hyper-specialization is competition-fit strategy for squeezing the performance ~ likely overfitting