Coder Social home page Coder Social logo

papers's People

Contributors

kweonwooj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

papers's Issues

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

Abstract

  • Propose Deep Voice 3, fully CNN + Attention-based Neural TTS system
  • Trains x10 faster, trained on 800+ hours with 2,000+ speakers
  • Contribution of kernel implementation to speed up inference

Details

  • Intro

    • Main idea borrowed from NMT, CNN-based auto-regressive model
    • Attention is monotonic and should be local in TTS
    • TTS models can output several waveforms, later convert to speech via WORLD, Griffin-Lim and WaveNet
  • Model Architecture

    • Input : Text Features (characters, phonemes, stresses)
    • Output : Acoustic Features (mel-band spectrogram, linear-scale log magnitude spectrogram, vocoder features)
    • Encoder
      • Pre-net + ConvNet x N + Post-net
      • Adds speaker embedding everywhere
        screen shot 2017-11-15 at 9 26 47 am
        screen shot 2017-11-15 at 9 29 04 am
    • Attention Block
      • Speaker embedding is added to positional embedding
        screen shot 2017-11-15 at 9 29 29 am
    • Converter
      • Different loss function for different output
        screen shot 2017-11-15 at 9 30 38 am
  • Text Preprocessing

    • uppercase all chars, remove intermediate punctuation, end with punctuation
  • Result

    • Higher MOS with x10 faster training
      screen shot 2017-11-15 at 9 31 11 am

Personal Thoughts

Link : https://arxiv.org/pdf/1710.07654.pdf
Authors : Ping et al. 2017

100-epoch ImageNet Training with AlexNet in 24 Minutes

Abstract

  • Large batch (32K) training with LARS algorithm enables 100-epoch ImageNet training with AlexNet in 24 minutes.
  • One hour for 90-epoch ResNet-50 with 512 Intel KNLs
  • Usually, 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days

Details

  • Data parallelism is dominant in large-scale training due to its stability
  • Large batch size means less updates in a fixed epoch.
    • existing solutions such as Linear Scaling (Krizhevsky 2014) and Warmup Scheme (Goyal et al. 2017) were not effective
    • in this paper, LARS algorithm (You et al. 2017)with warm-up scheme achieves same test accuracy level with shorter training time.

Personal Thoughts

  • Hardware is the key component of large parallel training
  • interested in LARS method

Link : https://arxiv.org/pdf/1709.05011.pdf
Authors : You et al. 2017

Can Active Memory Replace Attention?

Abstract

  • Yes for case of soft attention : somewhat mixed result across tasks.
  • Active memory operate on all of memory in parallel in a uniform way, bringing improvement in algorithmic task, image processing and generative modellings.
  • Deos active memory perform well in machine translation? [YES]

Details

  • [Attention]

    • Only a small part of the memory changes at every step, or the memory remains constant.
    • Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nautre of softmax.
      screen shot 2017-09-28 at 10 50 18 am
  • [Active Memory]

    • Any model where every part of the memory undergoes active change at every step.
      screen shot 2017-09-28 at 10 50 23 am
  • [NMT with Neural GPU]

    • parallel encoding and decoding
    • BLEU < 5
    • conditional dependence between outputs are not considered
      screen shot 2017-09-28 at 10 52 31 am
  • [NMT with Markovian Neural GPU]

    • parallel encoding and 1-step conditioned decoding
    • BLEU < 5
    • possibly, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance
      screen shot 2017-09-28 at 10 52 36 am
  • [NMT with Extended Neural GPU]

    • parallel encoding and sequential decoding
    • BLEU = 29.6 (WMT 14 En-Fr)
    • active memory decoder (d) holds recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.
      screen shot 2017-09-28 at 10 52 38 am
  • [CGRU]

    • convolutional operation followed by recurrent operation
    • stack of CGRU expands receptvie field of conv op
    • output tape tensor acts as external memory of decoded logits

Personal Thoughts

  • Same architecture, but encoder and decoder hidden states may be doing different things
    • encoder : embed semantic locally
    • decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
  • Will it work for languages with different sentence order?
  • What part of the translation problem can we treat as convolutional?
  • Is Transformer a combination of attention and active memory?

Link : https://arxiv.org/pdf/1610.08613.pdf
Authors : Lukas Kaiser (Google Brain) et al. 2017

Single-Queue Decoding for Neural Machine Translation

Abstract

  • Propose more flexible decoding algorithm which can revisit discarded hypotheses in a later step, Single Queue Decoding
  • Design a penalty function to punish the hypotheses that tend to produce longer or shorter than expected

Details

  • Beam Search has disadvantages in that

    • algorithm must give up some of existing hypotheses due to fixed beam size ~ exploration-exploitation dilemma
  • Single Queue Decoding

    • Save all hypotheses in the single queue
    • Extract arbitrary hypotheses with different length given the score function
      screen shot 2017-11-10 at 10 20 54 am
    • Score Function
      screen shot 2017-11-10 at 10 21 05 am
      screen shot 2017-11-10 at 10 21 10 am
      screen shot 2017-11-10 at 10 21 12 am
  • Results

    • Higher BLEU, relatively low speed delay
      screen shot 2017-11-10 at 10 21 28 am

    • SQD shows higher NLL value than beam search with same beam size
      screen shot 2017-11-10 at 10 19 12 am

Personal Thoughts

  • Improvement in NLL is not significant
  • Searches over more spaces, hence longer inference time
  • Not sure this is a good decoding method

Link : https://arxiv.org/pdf/1707.01830.pdf
Authors : Shu et al. 2017

Dual Supervised Learning

Abstract

  • Utilize dual tasks that have intrinsic connections with each other due to the probabilistic correlation (En-Fr vs Fr-En translation, Speech Recognition vs Text to Speech, Image Classification vs Image Generation)
  • Propose dual supervised learning method that trains dual tasks simultaneously.
  • Improves performance of both tasks

Details

  • Conditional distributions of the primal and dual tasks should ssatisfy the following equality :

screen shot 2017-09-27 at 12 25 00 pm

  • Add probabilistic duality term in loss function as specified as below :

screen shot 2017-09-27 at 12 25 28 pm

  • lambda_xy are hyperparameters and best performance is obtained with lambda near ~0.01. It shows that the effect of probabilistic duality is quite small.

Personal Thoughts

  • Utilizing duality of the tasks is clever and practical in theory, it will lead to more data.
  • The improvement, however, seems limited.

Link : https://arxiv.org/pdf/1707.00415.pdf
Authors : Yingce Xia (School of Information Science and Technology, University
of Science and Technology of China, Hefei, Anhui, China) et al. 2017

Attention and Augmented Recurrent Neural Networks

Abstract

  • Augmenting RNN with Attention is a new trend.
  • A human with a piece of paper is, in some sense, much smarter than a human without.
  • Since vectors are the natural language of neural networks, the memory is an array of vectors

Details

  • Neural Turing Machine
    • RNN with external memory bank
    • reading and writing : instead of predicting where to read/write (discrete), model always read/writes on all area but simply learn the weight
  • Attentional Interfaces
    • Basic attention
  • Adaptive Computation Time
    • a way for RNNs to do different amounts of computation each step
  • Neural Programmers
    • learns to create programs in order to solve a task

Personal Thoughts

  • attention is the key to next generation neural network

Link : https://distill.pub/2016/augmented-rnns/
Authors : Olah et al. 2016

Key-Value Memory Networks for Directly Reading Documents

Abstract

  • Introduce Key-Value Memory Network that makes reading documents more viable by utilzing different encodings in the addressing and output stages of the memory read operation
  • Achieves SOTA on existing WikiQA benchmark

Details

  • QA task has been directed toward using Knowledge Bases (KBs), which has proven effective, but it suffers from being too restrictive, as the schema cannot support certain types of answers, and too sparse.

  • Key-Value Memory Network is an extension of Memory Network.

    • Knowledge source is cumulatively added to context
    • question is embedded as query, which does inner product with key(context), and the softmax output is weighted on values(content).
    • In KVMemNet, memory slots are pairs of vectors

screen shot 2017-10-24 at 9 21 01 am

Personal Thoughts

  • QA task actively uses attention to achieve better scores, but are not fully applicable/related to NMT

Link : https://arxiv.org/pdf/1606.03126.pdf
Authors : Miller et al. 2016

Refining Source Representations with Relation Networks for Neural Machine Translation

Abstract

  • Relation Network + NMT to refine the encoding representation of the source
  • Claims old information is often forgotten, and words in front are not processed with respect to words in latter.

Details

  • Relation Network learns the relationship between source words
    • Q: how is it different from self-attention?
  • RN is composed of CNN + Graph Propagation + MLP Layer with LeakyRelus
    screen shot 2017-10-12 at 11 19 21 am
  • In EN-CN, outperforms Transformer (Transformer is on par with NMT+ in En-Cn pair)
    • seems they used word vocab, no bpe
  • Strong performance in long sentences (length 50+)
  • Visualization
    • word alignment table
    • actual source/ref/hyp results

Personal Thoughts

  • Enriching information passed to attention layer
  • Improvement in Decoding side should also be investigated
  • Re-read Relation Network and Related Works

Link : https://arxiv.org/pdf/1709.03980.pdf
Authors : Zhang et al. 2017

Unsupervised Neural Machine Translation

Abstract

  • NMT in completely unsupervised manner, relying on nothing but monolingual corpora.
  • Earlier works include triangulation and semi-supervised learning which still requires a strong cross-lingual signal
  • Shared Encoder + Denoising Autoencoder + Language specific Decoder

Details

  • Motivation

    • A parallel corpora with good quality is difficult/expensive to acquire, whereas monolingual corpus is relatively easier.
    • Low-resource languages or unique language pairs cannot afford to have a quality-parallel-corpora sufficient to train NMT model
  • Unsupervised NMT

    • fixed cross-lingual embedding is obtained via word2vec
    • Shared encoder encodes the meaning of the sentence, noise is included for robust learning
    • L1 decoder (De-noising autoencoder) attempts to re-produce the input, trying to learn the latent variables of inputs
    • L2 decoder (Language-specific decoder)
      • given En input to shared encoder, L2 decoder for English can be learned.
    • Training takes place by alternating training objective L1/L2 from batch to batch.
      screen shot 2017-11-02 at 9 51 25 am
  • Result

    • Simple de-noising makes model learn to copy, instead of translate
    • adding back-translation (L2 decoder) significantly improves the performance
    • BPE helps to tackle unknown words, although still weak in named entities
    • With small amount of parallel corpus (semi-supervised), performance can be improved further
      screen shot 2017-11-02 at 9 55 17 am
  • Analysis

    • Quantitatively, BLEU is still much lower (about BLEU 10 points) than SOTA supervised models, but translation in unsupervised manner does work.
    • Qualitatively, UNMT proves it goes beyond a literal word-by-word substitution and correctly translates structural differences between languages, a good sign

Personal Thoughts

  • my idea on VAE + GAN model is very similar to unsupervised NMT
  • Interested in seeing the shared encoder work
  • Papago needs them, we can extend the language with abundant monolingual data

Link : https://arxiv.org/pdf/1710.11041.pdf
Authors : Artetxe et al. 2017

SYSTRAN’s Pure Neural Machine Translation Systems

Abstract

  • Comprehensive Technical Overview and Empirical Results of NMT in Systran
  • 12 languages, for 32 language pairs

Details

  • [Corpus] Utilize 3 corpora for each language pair
    • a baseline corpus (1 million sentences) for quick experiments (day-scale)
    • a medium corpus (2-5M) for real-scale system (week-scale)
    • a very large corpora with more than 10M segments
  • [Train Epoch]
    • In Junczys-Dowmunt et al. 2016, authors mention using corpus of 5M sentences and training of 1.2M batches each having 40 sentences – meaning basically that each sentence of the full corpus is presented 10 times to the training.
    • In Wu et al. 2016, authors mention 2M steps of 128 examples for English–French, for a corpus of 36M sentences, meaning about 7 iterations on the complete corpus.
    • In our framework, for this release, we systematically extended the training up to 18 epochs and for some languages up to 22 epochs.
  • [PlaceHolder]
    • In most language pairs, our strategy combines a vocabulary shortlist and a placeholder mechanism
    • named entity placeholders (number, name, currency, url etc)
  • [Vocab]
    • For enko and jaen, used BPE to reduce the vocabulary size but also to deal with rich morphology and spacing flexibility that can be observed in Korean.
  • [Improvements of NMT]
  • [Pre-training]
    • using pre-trained model learned with generic corpus, and re-learning with domain-specific corpus enhances domain adaptation with fast speed
  • [Efficient Model]
    • 60% of parameters pruned w/o hurting the performance, by See et al. 2016
    • Distillation by Kim and Rush, 2016
      • slightly higher accuracy results for a 70% reduction of the number of parameters and a 30% increase on decoding speed
  • [Problems to be solved]
    • Missing words or parts of sentence
    • Badly managing quotes
    • Adequacy << Fluency in NMT
    • Very long sentences
    • Short word or the title of a news article
    • Cleaning the corpus
    • Alignment

Personal Thoughts

  • Very good paper, comprehensive and empirical.
  • Systran has good technical culture
  • Papago team must also be experimenting lots of ideas
  • Let's implement lots of ideas from academia!!

Link : https://arxiv.org/pdf/1610.05540.pdf
Authors : Crego et al. 2016

Compression of Neural Machine Translation Models via Pruning

Abstract

  • Magnitude-based pruning in NMT (RNN)
  • LSTM/Attention based NMT with 200M params can be pruned by 40% with very little performance loss on WMT14 En-De
  • With retraining, model can surpass original performance with 80% pruned model
  • Extending the magnitude-based pruning approach of Han et al 2015 for CNN into RNN

Details

  • Outline the NMT parameter architecture as below
    screen shot 2017-10-19 at 11 41 20 am

  • [Pruning Schemes]

    • Class-blind : prune x% with smallest magnitude, regardless of weight class
    • Class-uniform : prune x% with smallest magnitude within each class
    • Class-distribution : prune weights with magnitude less than standard deviation of each class
  • Class-blind scheme outperforms
    screen shot 2017-10-19 at 11 44 00 am

  • Important Parameters

    • Higher layers are more important than lower layers (this is opposite from CNN's phenomenon)
    • Attention and Softmax are crucial
    • FC params are important in lower vocab to embedding layer
    • Forget gate params are less important in lower layers
    • Diagonal values in recurrent params seem important
    • least common word embeddings are less important
  • Retraining

    • when re-training, performance enhances due to regularization effect
    • when started from sparse format, achieves slightly lower performance
      screen shot 2017-10-19 at 11 45 40 am
  • Generalizability

    • Pruning scheme shows similar results and phenomenon in smaller NMT model with different language pair
  • Future Work

    • This paper does single-step pruning and retraining. Multiple iterations may improve the performance.
    • Other pruning methods (Optimal Brain Damage and Optimal Brain Surgery) are not empirically investigated

Personal Thoughts

  • Great experimentation. Clever pruning schemes that shows exactly which params are important
  • Applicable to Papago NMT system

Link : https://arxiv.org/pdf/1606.09274.pdf
Authors : See et al. 2016

Toward a full-scale neural machine translation in production: the Booking.com use case

Abstract

  • Empirical results on training NMT in large scale E-commerce setting by Booking.com
  • Covers optimization, training and evaluation

Details

  • Model Architecture

    • 4-layer LTSM written in Lua
    • Use global attention
    • Use "case" embedding feature
      screen shot 2017-11-03 at 9 46 29 am
    • 0.3 residual
    • no batch size indicated
    • Handles named entity by pre-processing the input, detecting NE-tag in both sentences and replacing it with placeholder and simply copying it via attention map
  • Optimizer

    • 1M En-De dataset
    • SGD vs Adam vs Adagrad vs Adadelta (1.0, 0.0002, 0.1, 1.0)
    • SGD performs best
      screen shot 2017-11-03 at 9 43 38 am
  • Multi-GPU

    • Async vs Sync Multi-GPU
    • single GPU performs best ~ opposite of our in-house result
      screen shot 2017-11-03 at 9 44 45 am
  • Corpus Size

    • 1M, 2.5M, 5M, 7.5M, 10M corpus ran 90M iterations
    • 10M performs best after-all, with higher human eval which is latent in BLEU score (more data, the better it is)
      screen shot 2017-11-03 at 9 45 26 am
  • Evaluation

    • Adequacy + Fluency metric
      screen shot 2017-11-03 at 9 47 35 am

Personal Thoughts

  • Solid works and experiments on NMT
  • In-house data seems to be abundant and strong
  • good to see that they openly publish their results

Link : https://arxiv.org/pdf/1709.05820.pdf
Authors : Levin et al. 2017

Unsupervised Machine Translation using Monolingual Corpora Only

Abstract

  • Fully Unsupervised NMT using Monolingual Corpora only by FAIR
  • De-noising Auto-encoder + Language specific Decoder + Language Discriminator
  • Good paper on ICLR 2018
  • Enables better NMT for low-resource language pairs
  • Performance is still blocks below supervised NMT

screen shot 2017-11-05 at 9 35 39 pm

Details

  • Key Idea

    • Build a common latent space between the two languages
    • Learn to translate by reconstructing in both domains according to two principles
    • (i) the model has to be able to reconstruct a sentence in a given language from a noisy version of it, as in standard de-noising auto-encoders
    • (ii) The model also learns to reconstruct any source sentence given a noisy translation of the same sentence in the target domain, and vice versa
  • Learning Objective

    • De-noising Auto-Encoder : Embed sentence into latent space with noise and reconstruct it back

screen shot 2017-11-05 at 9 45 51 pm

- Cross-Domain : Minimizing loss for (Source in lang1 -> Latent Space -> Reconstructed Target in lang2 -> Back into Latent Space -> Reconstruct Source in lang1 )

screen shot 2017-11-05 at 9 45 56 pm

- Adversarial : Discriminator tries to identify the language by seeing the embedding in latent space, the model tries to fool by mapping same semantic sentences into same latent space in language independent manner

screen shot 2017-11-05 at 9 46 04 pm

- Final Objective Function

screen shot 2017-11-05 at 9 46 08 pm

  • Training
    • Model starts with an unsupervised word-by-word translation in an unsupervised way
    • Encoder tries to map the source sentence with noise into shared latent space, and reconstruct as in de-noising auto-encoder.
    • Decoder learns to reconstruct the input from the latent space, given a language flag
    • Discriminator tries to identify the source language in an adversarial setting

screen shot 2017-11-05 at 9 41 02 pm

screen shot 2017-11-05 at 9 41 18 pm

  • Model Selection
    • BLEU score for two-way translation is used as a evaluation metric
    • shows good correlation with classic BLEU

screen shot 2017-11-05 at 9 41 30 pm

screen shot 2017-11-05 at 9 42 23 pm

  • Results
    • Not sure the baselines were really meaningful
    • Unsupervised does learn something!

screen shot 2017-11-05 at 9 42 44 pm

  • Monolingual vs Parallel Corpus
    • 10M Monolingual ~ 100K Parallel Corpora

screen shot 2017-11-05 at 9 43 28 pm

  • Ablation Study
    • dropping subset of training scheme to see which part is critical in learning
    • De-noising Auto-Encoder and Cross-Domain are both critical

screen shot 2017-11-05 at 9 44 10 pm

Personal Thoughts

  • Great work of Unsupervised NMT
  • Better than Cho's paper because it is fully differentiable
  • Interesting to see the shared embedding concept!

Link : https://arxiv.org/pdf/1711.00043.pdf
Authors : Lample et al. 2017

Learning to Compute Word Embeddings On the Fly

Abstract

  • Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare (long tail)
  • Representations of rare words(OOV) is difficult to train, often too data-hungry
  • Propose a method for predicting embeddings of rare words on the fly with small amount of auxiliary data

Details

  • Definition Embedding

    • Instead of using simple OOV embedding, use definition encoder that takes the dictionary definition of rare word and inject in the original encoder
      screen shot 2017-11-01 at 10 41 41 am
  • Experiments

    • baseline : simple RNN structure with OOV
    • GloVe : unsupervised GloVe pretrained embedding with 840M monolingual corpus
    • with Dictionary or Spelling
    • Still margin weaker than GloVe, but given the amount of corpus/pre-training necessary, it is a good step forward
      screen shot 2017-11-01 at 10 43 21 am

Personal Thoughts

  • How is actual implementation done?
    • use same encoder while training?
    • use already pre-trained encoder for definition encoder?
  • How is the performance in NMT?

Link : https://arxiv.org/pdf/1706.00286.pdf
Authors : Bahdanau et al. 2017

Deeply Supervised Nets

Abstract

  • Introduce Companion Objective to the individual hidden layers, in addition to the overall objective of output layer.
  • Directly pursue feature discriminativeness at all hidden layers to enhance network performance.

Details

  • companion objective is added to overall objective with hyperparameter gamma and decay rate alpha.

screen shot 2017-09-27 at 2 26 02 pm

Personal Thoughts

  • I thought making hidden layers discriminative will greatly enhance performance, but the result is not as competitive as expected.
  • Optimizing the intermediate layers is an interesting area to investigate.

Link : https://arxiv.org/pdf/1409.5185.pdf
Authors : ChenYu Lee (UCSD) et al. 2014

StarSpace: Embed All The Things!

Abstract

  • a general-purpose neural embedding model
  • FAIR's SOTA embedding model

Details

  • StarSpace treats all features as embedding, a set of features (entity) as bag-of-features (also embedding) and optimizes w.r.t. to the similarity of label which is also an embedding.
  • Use of positive generator and negative generator stabilizes the learning mechanism.
  • widely applicable and shows strong performance in text classification, embedding etc

Personal Thoughts

  • can better embedding help performance of translation?

Link : https://arxiv.org/pdf/1709.03856.pdf
Authors : Wu et al. 2017

Weighted Transformer Network for Machine Translation

Abstract

  • Propose Weighted Transformer, a Transformer with modified attention layers, that performs better in BLEU score and converges 15~40% faster
  • Specifically, replace the multi-head attention by multiple self-attention branches

Details

  • In short, Transformer is Self-Attention + Positional Encoding + Multi-head Attention
  • Multi-head Attention + FFN
    • generate multiple heads from same input QKV, concatenate them after single linear transformation

screen shot 2017-10-29 at 9 06 32 pm

- After each layer, two-layered FFN is used

screen shot 2017-10-29 at 9 06 44 pm

  • Proposed Multi-branch Attention
    • κ can be interpreted as a learned concatenation weight and α as the learned
      addition weight
    • κ scales the contribution of the various branches before α is used to sum
      them in a weighted fashion.
    • Unlike the Transformer, which weighs all heads equally, the proposed mechanism allows for ascribing importance to different heads
    • it only adds 192 parameter to existing 213M params in big model

screen shot 2017-10-29 at 9 09 39 pm

  • Weighted Transformer in Graph

screen shot 2017-10-29 at 9 10 51 pm

  • Training Details

    • label smoothing with 0.1, dropout with 0.1
    • Adam optimizer with same setting as Transformer, but larger learning rate for (κ,α) and freeze weight for (κ,α) for last 10K iterations
    • 25,000 token per batch
    • Training Time
      • Transformer small : big = 100K : 300K vs wTransformer small : big = 60K : 250K
  • Result

screen shot 2017-10-29 at 9 18 54 pm

  • Parameter Search Result

screen shot 2017-10-29 at 9 19 14 pm

  • Regularization Effect shown via viz

screen shot 2017-10-29 at 9 18 20 pm

  • (κ,α) Trend during training
    • each branch gets different amount of weight
    • last 10K fixed

screen shot 2017-10-29 at 9 19 36 pm

  • Gating
    • replace the summation in Equation (7) by a gating structure that sums up the contributions of the top k branches with the highest probabilities
    • not sure how it is done.

Personal Thoughts

  • ICLR 2018 submission
    • ICLR review is good learning material.
  • Improvement in Performance is not of great margin
  • Adding new structure to existing SOTA, but without strong theoretical background, less novelty
  • Worth trying, when we already have good Transformer implementation

Link : https://openreview.net/forum?id=SkYMnLxRW&noteId=SkYMnLxRW
Authors : Anonymous

Trainable Greedy Decoding for Neural Machine Translation

Abstract

  • Decoder in NMT is simply rule-based, one can improve its performance by learning to decode
  • Trainable greedy decoder learns to manipulate the hidden state of a trained neural translation system with an arbitrary decoding objective (BLEU or perplexity)
  • trained by a novel variant of deterministic policy gradient, called critic-aware actor learning.

Details

  • Much of the research on neural machine translation has focused solely on improving the model architecture, not on decoding
  • Cho (2016) showed the limitation of greedy decoding by simply injecting unstructured noise into the hidden state of the neural machine translation system
  • Uses Deterministic Policy Gradient with Critic-Aware Actor Learning for stable learning algorithm

Personal Thoughts

  • must read Cho's paper on NPAD
  • must learn about reinforcement learning (actor-critic, policy gradient etc)
    • mathematical formula for loss function of actor-critic model and model figures (Fig.1) are difficult to understand

Link : https://arxiv.org/pdf/1702.02429.pdf
Authors : Gu et al. 2017

Tacotron: Toward End-to-End Speech Synthesis

Abstract

  • Propose Tacotron, an End-to-End Text-to-Speech Seq-to-Seq Model with Attention with <text, audio> data
  • Frame level generation, much faster than sample level auto-regressive model

Details

  • Modern TTS models are complex and modular

    • classic : text extraction, feature extraction, acoustic model, and vocoder
    • Wavenet : slow due to its sample-level autoregressive nature, also requires conditioning on linguistic features from an existing TTS frontend, hence not end-to-end
    • Wang's Seq2Seq model : requires pre-trained hidden Markov model aligner
  • Model Architecture

    • Input : Character level text
    • Encoder : Pre-net > CBHG
    • Decoder : Pre-net > CBHG with frame_size=3, outputs spectrogram
    • Griffin-Lim reconstruction : outputs waveform

screen shot 2017-11-08 at 9 56 01 pm

  • CBHG
    • Lee et al. 2017
    • add non-causal convolutions, batch normalization, residual connections, and stride=1 max pooling from original architecture for better regularization

screen shot 2017-11-08 at 9 57 49 pm

  • Griffin-Lim (Post-processing)
    • Griffin-Lim algorithm (Griffin & Lim, 1984) synthesizes waveform from the predicted spectrogram
    • used Tensorflow implementation
    • it helps to improve performance by polishing the output via context data

screen shot 2017-11-08 at 10 03 17 pm

  • Hyper-parameters

screen shot 2017-11-08 at 10 01 52 pm

  • Data

    • Internal North American English dataset with 24.6 hours
  • Result

    • 5-scale mean opinion score evaluation

screen shot 2017-11-08 at 10 05 15 pm

Personal Thoughts

  • Is the data enough?
  • Is there no automatic evaluation metric?

Link : https://arxiv.org/pdf/1703.10135.pdf
Authors : Wang et al. 2017

Adaptive Computation Time for Recurrent Neural Networks

Abstract

  • Introduces Adaptive Computation Time (ACT), an algorithm that allows RNN to learn how many computational steps to take between receiving an input and emitting an output
  • Experimental results on four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers show that performance is dramatically improved by the use of ACT.
  • In character-level language modelling on the Hutter prize Wikipedia, ACT does not yield large gains in performance but it provides insight into the structure of the data, with more computation allocated to harder-to-predict transitions such as spaces between words and ends of sentences.

Details

  • The approach pursued here is to augment the network output with a sigmoidal halting unit
    whose activation determines the probability that computation should continue.

  • RNN vs RNN with ACT

    • In short, it uses intermediate states which are activated by sigmoidal halting unit dynamically.
    • Timing penalty is applied to minimize the amount of pondering when not necessary.
    • Exact formulation and understanding of components must be revisited..! (re-read)
      screen shot 2017-10-24 at 9 29 14 am
  • Experiment on Parity Error

    • ACT does have an impact on lowering sequence error rate, with less penalty, it ponders more
      screen shot 2017-10-24 at 9 31 08 am
      screen shot 2017-10-24 at 9 32 09 am
  • Experiment on Wikipedia Character Prediction

    • ACT with time penalty has negligible effect, but with lower time penalty, model ponders more and especially on spaces and eos.
      screen shot 2017-10-24 at 9 32 42 am

Personal Thoughts

  • Beautiful vizualization
  • I'm not sure how ACT is utilizing attention well yet..
  • No direct link to usage in NMT
  • But Alex Graves is the must-read and must-understand researcher

Link : https://arxiv.org/pdf/1603.08983.pdf
Authors : Alex Graves et al 2017

Unsupervised Pre-training for Sequence to Sequence Learning

Abstract

  • Propose to pre-train the encoder and decoder of seq2seq model with the trained weights of two language models
  • An additional language modeling loss is used to regularize the model during fine-tuning
  • SOTA in WMT 2014 English->German

screen shot 2017-11-10 at 12 00 32 am

Details

  • Basic Methodology

    • Train LM_src, LM_tgt using monolingual data
    • Use LM-pre-trained model's Embedding Layer, 1st LSTM Layer and Softmax Layer in Decoder
  • Improvement

    • Add monolingual loss during training to preserve the feature extractor of LMs (one can freeze pre-trained weights for few epochs and train the whole weight later too)
    • Residual connection to mitigate initial noise corrupting the pre-trained weights
    • Multi-layer attention, extracting both low and high level contextual information

screen shot 2017-11-10 at 12 01 50 am

  • Result
    • Outperforms Back-translation

screen shot 2017-11-10 at 12 03 41 am

  • Ablation Study
    • Pre-training decoder is better, because decoder does more difficult job of keeping semantics and generating sentence with correct syntax
    • Gains of pre-training greatly overlap
    • LM objective serves as a big regularizer, it assures fluency for sure

screen shot 2017-11-10 at 12 04 11 am

Personal Thoughts

  • Why only benchmark back-translation?
  • How can I train Language Model with internal monolingual data?
  • Authors did good and careful job of not corrupting pre-trained weights, knowing that LM weights provide good fluency features

Link : https://arxiv.org/pdf/1611.02683.pdf
Authors : Ramachandran et al. 2017

Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Abstract

  • Using monolingual corpora in low-resource NMT by adding copied-corpus in training data
  • BLEU benefits around ~1.2 in Turkish > English and Romanian > English translation task

Details

  • Related Works

    • Back Translation by Sennrich et al. 2016: train target->source NMT to perform translation of target->source on target monolingual corpora, and resulting parallel corpora is combined with original parallel corpora.
    • Multi-task systems by Johnson et al. 2016: combining multiple translation directions (French->English, German->English etc)
    • This paper proposes using simple copy of target monolingual corpora to obtain parallel target->target corpora, and use it as training data
  • Amount of Resource per language pair
    screen shot 2017-10-24 at 9 42 59 am

  • Performance improves with Low-Resource pairs
    screen shot 2017-10-24 at 9 43 31 am

  • On the contrary of assumption,

    • fluency does not improve with copied corpus added, shown by perplexity in language model
    • the translation of pronouns, named entities and rare words improve, shown by pass-through accuracy
      screen shot 2017-10-24 at 9 52 27 am
  • Amount of Monolingual Data

    • Even with 3:1 ratio of monolingual to parallel corpora, BLEU increases.
    • Monolingual copied corpora does not hurt learning even its ratio is relatively high
      screen shot 2017-10-24 at 9 53 33 am

Personal Thoughts

  • Lesson : Low-Resource NMT techniques work when parallel corpus below 1M
  • Simple and Elegant method with incremental result, but not confident that BLEU +1.2 is a significant improvement in quality

Link : http://statmt.org/wmt17/pdf/WMT15.pdf
Authors : Currey et al 2017

CHRF: character n-gram F-score for automatic MT evaluation

Abstract

  • Propose the use of character n-gram F-score for automatic evaluation of machine translation output (language independent, tokenization independent)
  • CHRF3 score – for translation from English, showed the highest segment-level correlations

Details

  • ChrP (Precision) & ChrR (Recall)

screen shot 2017-10-15 at 10 43 27 pm

  • 65% preferred than BLEU

Personal Thoughts

screen shot 2017-10-15 at 10 40 04 pm

Link : http://www.statmt.org/wmt15/pdf/WMT49.pdf
Authors : Popovic et al. 2015

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Abstract

  • Propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold
  • Although there are lots of effort in network architecture, learning algorithms and novel applications, decoding is not well studied

Details

  • Recurrent models are de facto for linguistic tasks (language models, machine translation, dialogue, question answering etc)

  • Noisy Parallel Approximate Decoding (NPAD)

    • meta-algorithm that runs in parallel many chains of the noisy version of an inner decoding algorithm(greedy or beam search)
    • fully parallelizable, speed is almost equivalent to single greedy search
    • a neighborhood in the hidden state space corresponds to a set of semantically similar configurations in the input space, regardless of whether those configurations are close to each other in the input space

screen shot 2017-10-30 at 11 16 44 pm

- adding Gaussian noise in calculation of logit, start with high level of noise and anneal it as the decoding progresses - among M hypotheses, select the argmax of log probability
  • Why not sampling?

    • more efficient to sample in the hidden state space than in the output state space
    • hidden space 'fills in' the semantically similar neighbors, whereas state space is more sparse
  • Different from Diverse Decoding

    • Diverse decoding is applicable to beam search only, and is conditional on previous beams
    • NPAD is independent and parallelizable
  • Stochastic Sampling vs NPAD

    • stochastically sampling from final softmax distribution does improve upon greedy beam search, but NPAD with correct parameter outperforms both NLL and BLEU

screen shot 2017-10-30 at 11 27 07 pm

  • Number of Parallel Chains
    • obviously, larger parallel chains improve performance

screen shot 2017-10-30 at 11 28 12 pm

  • NPAD with beam search
    • further improvement, although marginal, when combined with beam search

screen shot 2017-10-30 at 11 29 28 pm

  • NPAD vs Diverse Decoding
    • NPAD with beam search is marginally better than Diverse Decoding, almost identical to me

screen shot 2017-10-30 at 11 30 36 pm

Personal Thoughts

  • Fast, applicable decoding mechanism
  • Interesting idea to inject noise (annealed by time-step) during training as well

Link : https://arxiv.org/pdf/1605.03835.pdf
Authors : Cho et al. 2016

Block Sparse RNN

Abstract

  • Even though pruning methods reduce number of parameters by 90%, the speed-up is less than expected on different hardware platforms due to indexing overhead, irregular memory access and inability to utilize array data-path
  • Propose pruning weights in block format, train with group lasso regularization to encourage sparsity in the model
  • 10x smaller parameters with ~10% loss of accuracy

Details

  • Block Prune
    • prune blocks of a matrix instead of individual weights, if maximum magnitude of a block is less than threshold

screen shot 2017-10-30 at 9 20 03 pm

  • Pruning during Training
    • this method actively prunes the parameters during training
    • Hyperparameters for pruning are

screen shot 2017-10-30 at 9 21 06 pm

  • Group Lasso Regularization
    • L2 loss of group of weights

screen shot 2017-10-30 at 9 21 51 pm

  • Experiments
    • Speech Recognition system with CNN, RNN and FC layers
    • < 5% loss of accuracy obtained when model parameter size is reduced by 1/3 ~ 1/4
    • BP (block pruning), GLP (group lasso regularization with block pruning)

screen shot 2017-10-30 at 9 22 19 pm

  • Speed-up
    • Block Pruning has higher speed-up when batch is big

screen shot 2017-10-30 at 9 24 22 pm

  • Pruning Schedule
    • BP and GLP actively prunes than ordinary Weight Pruning

screen shot 2017-10-30 at 9 25 10 pm

  • Performance over Prune ratio
    • sudden decrease in performance after 90% threshold
    • lower layers are pruned more than higher layers

screen shot 2017-10-30 at 9 25 47 pm

Personal Thoughts

  • wanted to see pruning in NMT
  • batch=1 has speed-up of ~3, wonder how they implemented it.
    • if op is sparse, then do I have to code new inference nmt.py?
  • Parameter settings in experiments were quite odd..
    • not sure what the real message is, number of hidden sizes and resulting number of parameters are just all over the place
  • I think they will be rejected..

Link : open-reveiw @ ICLR2018
Authors :

Squeeze-and-Excitation Networks

Abstract

  • Propose a novel architectural unit, “Squeeze-and-Excitation”(SE) block, that adaptively re-calibrates channel-wise features in CNN.
  • Ensemble of SENets won 1st place in ILSVRC 2017 classification task with top-5 error rate 2.251% (~25% improvement from last year)

Details

  • Squeeze unit
    • global average pooling over channel (describes the whole channel in single number)
  • Excitation unit
    • 2 x FC layers with dimension reduction and dimension rescale (ratio of 16) with ReLU and sigmoid is multiplied to scale the original input

screen shot 2017-09-27 at 11 39 17 pm

  • Can easily be inserted into existing SOTA CNN architectures. (ResNet, ResNeXt, Inception etc)
  • only fraction of parameter increase (<10% of whole param)
  • Figure 8. shows that lower layers have almost identical distributions across labels, upper layers have meaningfully different distributions across labels, and last layer seems to be saturated.

Personal Thoughts

  • Very clever idea of tweaking channel-wise inter-dependency
  • Applicable to re-calibrate channels of encoder states before they go into attention
  • Great work by self-driving car start-up, Momenta.ai in Beijing

Link : https://arxiv.org/pdf/1709.01507.pdf
Authors : Jie Hu (Momenta) et al. 2017

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Abstract

  • Three stage compression pipeline reduces parameter size by 35x ~ 49x
  • Compressed network has 3x ~ 4x speed enhancement and 3x ~ 7x energy efficiency.
    screen shot 2017-10-18 at 11 39 52 am

Details

  • [Network Pruning] by Han et al. 2015

    • Learn connectivity via normal network training.
    • Prune the small-weight connections: all connections with weights below a threshold are removed from the network.
    • Retrain the network to learn the final weights for the remaining sparse connections.
    • Pruning reduced the number of parameters by 9× and 13× for AlexNet and VGG-16
      model.
  • [Trained Quantization and Weight Sharing]

    • Use k-means clustering to identify the shared weights for each layer of a trained network
    • Weights are quantized to k bins and updated via back propagation in uniform manner
    • Linear centroid initialization is preferred over random/density due to better spreadability of large weights
      screen shot 2017-10-18 at 11 42 32 am
  • [Huffman Coding]

  • Experiments

    • Baseline experiment with MNIST using LeNet-300 (1070KB -> 27KB ~ 40x)
    • AlexNet, VGG-16 on ImageNet (35x ~ 49x compression)
    • Pruning and Quantization works well together
      screen shot 2017-10-18 at 11 48 08 am
    • Speed Enhancement
      screen shot 2017-10-18 at 11 48 36 am
    • SOTA among loss-less compression
      screen shot 2017-10-18 at 11 49 13 am

Personal Thoughts

  • Very well-written paper.
  • Applicable to NMT
  • Tested in CNN, will it work in RNN?
  • Systran also uses pruning only, since quantization with network sharing is not supported in cuSPARSE etc

Link : https://arxiv.org/pdf/1510.00149.pdf
Authors : Han et al. 2016

Beam Search Strategies for Neural Machine Translation

Abstract

  • Propose to speed up the decoder by applying a more flexible beam search strategy whose candidate size may vary at each time step depending on the candidate score
  • 10% speed up in beam_size=5 without loss of accuracy

Details

  • Beam Search is disadvantageous

    • Less adaptive, it expands candidates whose scores are much worse than the current best
    • It discards hypotheses if they are not within the best scoring candidates, even if the scores are close
  • Search Strategies

    • Relative Threshold Pruning
      • relative threshold compared to the best candidte
        screen shot 2017-11-10 at 10 26 50 am
    • Absolute Threshold Pruning
      • Discard candidates less than absolute margin than best candidate
        screen shot 2017-11-10 at 10 26 55 am
    • Relative Local Threshold Pruning
      • Consider the score of last generated word only in pruning
        screen shot 2017-11-10 at 10 26 57 am
    • Max Candidates per Node
      • Fix the number of candidates with same history in each time step
  • Fan Out per Sentence

    • fan out : number of candidates we expand
    • Original BeamSearch has linear fan out
    • Proposed BeamSearch adaptively reduces the number of fan outs
      screen shot 2017-11-10 at 10 30 02 am
  • Results

    • With beam_size=5, 10~13% speed improvement with using all the proposed methods
      screen shot 2017-11-10 at 10 31 17 am
      screen shot 2017-11-10 at 10 31 19 am

Personal Thoughts

  • Better decoding speed without hurting performance, this is what I've wanted!

Link : https://arxiv.org/pdf/1702.01806.pdf
Authors : Freitag et al. 2017

A Teacher-Student Framework for Zero-Resource Neural Machine Translation

Abstract

  • Train source-to-target NMT (student) without parallel corpora available, guided by the existing pivot-to-target NMT (teacher) on a source-pivot parallel corpus
  • X : source, Y : target, Z : pivot

Details

  • Related Works

    • Triangulated pivot-based method (X->Z, Z->Y), exposed to error propagation issue
    • multilingual (shared encoder/decoder structure)
  • Teacher-Student Approach

    • based on translation equivalence assumption (similar to knowledge distillation - sequence level)
      screen shot 2017-11-07 at 9 37 02 am
  • Sentence-Level Teaching

    • Assumption : If a source sentence x is a translation of a pivot sentence z, then the probability of generating a target sentence y from x should be close to that from its counterpart z
    • Minimizing KL divergence of two distribution leads to good translation from X to Y
      screen shot 2017-11-07 at 9 39 53 am
    • Remove teacher model parameter, they are fixed
      screen shot 2017-11-07 at 9 40 02 am
    • Training objective is to minimize the below
      screen shot 2017-11-07 at 9 40 06 am
  • Word-Level Teaching

    • Similarly, training objective is to minimize the below
      screen shot 2017-11-07 at 9 41 31 am
  • Result

    • Word-sampling outperforms sentence-beam
      screen shot 2017-11-07 at 9 42 03 am

Personal Thoughts

  • Does not have to go with absolute zero resource, rather try learning with monolingual only
  • Very similar to knowledge distillation, using the distribution learnt in the translation model to learn another student model

Link : https://arxiv.org/pdf/1705.00753.pdf
Authors : Chen et al. 2017

Ensemble Distillation for Neural Machine Translation

Abstract

  • Empirical experiments on Knowledge Distillation using Ensemble, OracleBLEU, same-sized student model etc

Details

  • Teacher Networks

    • Ensemble Teacher Model : ensemble of 6 models with different random initialization as teacher
    • Oracle BLEU Teacher Model : among beams, choose the one with highest bleu
  • Data Filtering

    • Amongst translated training data, if TER is higher than 0.8, remove from both original training data and translated data
  • Experiment

    • WMT2016 German -> English task with 3.9M parallel sentences
    • 40k BPE
    • 1,000 GRU, 620 Word Embedding, batch 64, SGD with annealing learning rate
  • Results

    • Single Teacher : Training with all data in same-sized student model outperforms
      screen shot 2017-11-09 at 11 06 00 am
    • Ensemble Teacher : data filtering is better
      screen shot 2017-11-09 at 11 06 05 am
    • Oracle BLEU Teacher : not effective
      screen shot 2017-11-09 at 11 06 07 am
    • Ensemble Teacher with Small Student : always have loss of accuracy
      screen shot 2017-11-09 at 11 06 10 am

Personal Thoughts

  • Application to Papago
    • Try to get the best-performing teacher model via Ensemble or high beam-size, NPAD or whatever
    • Training same-sized student model for better performance is impressive
    • Training small student does have loss of accuracy, but ours is too large. We can do better

Link : https://arxiv.org/pdf/1702.01802.pdf
Authors : Freitag et al. 2017

Neural GPUs Learn Algorithms

Abstract

  • Neural Turing Machines (NTMs) learn an algorithm from examples, that are fully differentiable computers that use backpropagation to learn their own programming.
  • Propose Neural GPU, a type of convolutional gated recurrent unit that is highly parallel and efficient to train.
  • Neural GPU can be trained on short instances of an algorithmic task(addition and multiplication) and successfully generalize to long instances.

Details

  • Use CGRU (Convolutional Gated Recurrent Unit) that is similar to GRU with convolutional kernel added as main block.
  • Good performance in addtion and multiplication, good generalization to longer sequences.
  • Great effort in optimization process
    • Grid Search : 3^6 = 729 parameters
    • Curriculum Learning : train n-digit number only after making 90% performance in (n-1)-digit number
    • Gradient Noise : add Gaussian noise to gradient, multiplied by fraction of non-fully-correct output
    • Gate cutoff : cutoff for sigmoid function
    • Parameter Sharing Relaxation : let hidden units of RNN be different and gradually converge to single parameter (Relaxation was critical in fitting the training data)

Personal Thoughts

  • Great engineering and effort in optimization process
  • Well-explained, thorought paper to read

Link : https://arxiv.org/pdf/1511.08228.pdf
Authors : Lukasz Kaiser, Ilya Sutskever (Google Brain) et al. 2016

Understanding Deep Learning Requires Rethinking Generalization

Abstract

  • We must rethink about generalization
  • Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training, but these traditional approaches fail to explain why large neural networks generalize well in practice.
  • Experiment with SOTA CNN models on image classification with SGD, fitting a random training data.

Details

  • Deep neural networks easily fit random labels.
    • Effective capacity of the model is big enough to fully memorize the randomized training data.
    • Inception V3 (with dropout and weight decay) fits perfectly to random training data - not truly generalizing.
  • [Summary] Both explicit and implicit regularizer, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.
    • l2 norm is not absolute : weights with higher l2 norm have better generalization than weights with lower l2 norm
    • SGD acts as an implicit regularizer
    • On small data sets that even Gaussian kernel methods can generalize well with no regularization.
    • Early stopping has potential of regularization - the effect is case-by-case.

Personal Thoughts

  • Understanding neural network is difficult, because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
  • Good models are models that generalize well, where is this good generalization really coming from?

Link : https://arxiv.org/pdf/1611.03530.pdf
Authors : Chiyuan Zhang (Google Brain) et al. 2017

Structured Attention Networks

Abstract

  • Adding structure to Attention module - a linear chain conditional random field and a graph-based parsing model
  • Experiments on tree transduction, neural machine translation, question answering and natural language inference shows better performance and improved behavior

Details

  • Attention is a function of key, value and query where key holds the whole context, query holds the context to be answered and value holds relevant contents.

  • In author's word, attention mechanism is the expectation of an annotation function with respect to a latent variable which is parameterized to be function of source and query.

  • Segmentation Attention

    • using linear-chain CRF with pairwise edges, it adds pairwise structure
      screen shot 2017-10-26 at 10 04 55 am
  • Syntactic Attention

    • using graph-based parsing model, it adds tree-like structure
  • End-to-End training

    • forward pass is simple
    • backprop is not fully optimized with off-the-shelf tools
    • training takes 5x slower than simple attention mechanism, inference is almost similar
  • Neural Machine Translation

    • EnJa data from WAT, 500k sentences, less50
    • character-level and word-level with vocab cut-off of 10
    • result is not significant in word-level, slight increase in character-level
      screen shot 2017-10-26 at 10 07 29 am
    • Visualization of attention : shows richer, denser attention when structure is added
      screen shot 2017-10-26 at 10 08 52 am

Personal Thoughts

  • Agree on enriching the attention mechanism is a good are of research
  • not sure EnJa from WAT was good benchmark corpus, no significant improvement
  • too much information is compressed into attention mechanism
    • even the single token holds distributed context/content from surrounding.

Link : https://arxiv.org/pdf/1702.00887.pdf
Authors : Kim et al 2017

Learning to Translate in Real-time with Neural Machine Translation

Abstract

  • Simultaneous NMT with binary RL decision maker
  • Evaluation metric for simultaneous NMT is a combination of quality (BLEU) and delay

Details

  • Uni-directional RNN as NMT Enc-Dec
  • RL controller deciding whether to READ or WRITE in policy gradient

screen shot 2017-10-14 at 2 02 12 am

  • Evaluation
    • Quality : BLEU
    • Delay : Average Proportion (by Cho et al. 2016), Consecutive Weight Length and Target Delay

Personal Thoughts

  • Extensive experimentation seems to be done
  • simultaneous MT problem is very different from ordinary MT

Link : http://www.aclweb.org/anthology/E17-1099
Authors : Gu et al. 2017

Neural Machine Translation with Reconstruction

Abstract

  • NMT systems often lacks adequacy
  • Propose a novel encoder-decoder-reconstructor framework for NMT, utilizing target-source information as additional feedback

Details

  • New training objective, adding reconstruction error with lambda = 1

screen shot 2017-10-11 at 1 08 14 pm

- Raises issue of sub-optimality in decoding likelihood objective, and empirically shows performance improves even with very large beam size (1000), but not quite sure this is important. In practice, beam size < 10 is enough.

Personal Thoughts

  • Surprised that BLEU improvement is small (Is BLEU +1.5 really significant?)
  • Use of target-source information in parallel corpus was impressive
  • NMT is difficult to operationalize problem

Link : https://arxiv.org/pdf/1611.01874.pdf
Authors : Tu et al. 2016

Understanding Black-box Predictions via Influence Functions

Abstract

  • Use influence function to trace a model's prediction back to its training data.
  • Approximation of influence function that requires gradients and Hessian vectors provides valuable information
  • Useful in debugging models and detecting dataset errors

Details

  • Using influence function, one can ask questions such as "What is the model parameter like when certain training data was missing/altered?" without re-training the whole model
  • Useful in detecting adversarial examples
  • Useful in fixing mislabeled examples by providing good candidate lists, but limited boost compared to simple listing via highest training loss

Personal Thoughts

  • Understanding neural network is difficult, because all the theoretical assumptions do not hold in non-convex, data-dependent, .. environment.
  • Good approximation methods are always powerful and applicable

Link : https://arxiv.org/pdf/1703.04730.pdf
Authors : Pang Wei Koh(Stanford), Percy Liang(Stanford)

Confidence through Attention

Abstract

  • Use attention distribution to evaluate the translation quality
  • Use attention-filtered synthetic data added to existing parallel corpus to improve NMT translation quality in BLEU

Details

Attention-based Metrics

  • Coverage Deviation Penalty
    • aims to penalize the sum of attentions per input token for going too far from 1

screen shot 2017-10-28 at 12 19 42 am

  • Absentmindedness Penalty
    • dispersion of attention is measured via the entropy of the predicted attention distribution. Again, we want the penalty value to be 1.0 for the lowest entropy and head towards 0.0 for higher entropies

screen shot 2017-10-28 at 12 19 46 am

  • Training NMT with additional data
    • back-translation is good
    • but, attention-filtered synthetic data is also better
    • it helps especially in morphologically rich -> weak language direction

screen shot 2017-10-28 at 12 24 27 am

Personal Thoughts

  • Making data more rich and smoother by back-translation, copied-corpus, seq-level knowledge distillation and attention-filtered synthetic corpus is very strong

Link : https://arxiv.org/pdf/1710.03743.pdf
Authors : Rikters et al 2017

Findings of the 2017 Conference on Machine Translation (WMT17)

Abstract

  • Comprehensive explanation of WMT17 Tasks and their results
    • Machine Translation tasks : news, biomedical and multimodal
    • Evaluation tasks : metric and run-time estimation of MT quality
    • Automatic Post-Editing task
    • Neural MT training task
    • Bandit Learning task

Details

  • Main in MT task (news)
    • eval set is 1,500 per language direction. Total 3,000 sentences.
  • Evaluation of Direct Assessment (DA) via crowd-sourcing was impressive
    • requires monolingual background
    • careful control variables (replica of ref, real MT output, altered MT output) to filter out unreliable crowd (gamer)
    • pearson correlation of 0.97+ with professional researcher's result

Personal Thoughts

  • WMT is a good conference.
  • Decisions being made such as continual usage of HTER as metric where HTER has discrepancy between BLEU and human evaluation for future reference, careful experimentation and validation of DA metric are mature

Link : http://statmt.org/wmt17/pdf/WMT17.pdf
Authors : Bojar et al. 2017

The Helsinki Neural Machine Translation System

Abstract

  • Helsinki NMT ranked 1st inWMT2017 News Translation task in English-Finnish

Details

  • Arsenals

    • Layer Normalization : preliminary experiments showed no improvement
    • Variational Dropout : dropout in recurrent states
    • Context Gates : achieved better cross-entropy, but no improvement in BLEU or chrF3
    • Coverage Decoder : preliminary experiments showed no improvement
    • Ensemble : Proper ensemble is best, but Parameter Averaging also helps
  • Experiments

    • Choice of Segmentation Strategy
      • BPE in decoder performs well, char-level decoder performs high in chrF3

screen shot 2017-11-05 at 9 29 09 pm

- Ensemble - Proper ensemble is best, but parameter averaging helps

screen shot 2017-11-05 at 9 29 14 pm

- In dev set, they found lots of contractions (wouldn't etc) which were not present in training set, so they de-tokenized them

Personal Thoughts

  • Lots of ideas, tested in preliminary baseline and applied effective ones in large scale data
  • Language specific tunings such as dev set tuning and exhaustive search in enc/dec segmentation strategy, which lead to first place in English-Finnish task

Link : https://arxiv.org/pdf/1708.05942.pdf
Authors : Ostling et al. 2017

What do Neural Machine Translation Models Learn about Morphology?

Abstract

  • Analyze the representations learned by neural MT models through part-of-speech and morphological tagging tasks
  • Parameters include : word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations

Details

  • Train NMT model and use hidden layer units to perform other linguistic task in order to compare the quality of representation in each parameter variables.

  • Findings

    • char-level are better at learning morphology than word-level
    • Lower layers of the encoder are better at capturing word structure, while deeper networks improve translation quality, suggesting that higher layers focus more on word meaning.
    • Morphologically rich -> poor translation is difficult, the opposite is simpler task
    • decoder learns very little about the word structure. When attention used, it learns even less.

Personal Thoughts

  • Good methodology, trying to understand the amount of information in the representation via looking at performance in other task was unique
  • Good experiments, visualizations and explanations to the phenomenon
  • Not sure, NMT has been more transparent.
  • Role of the decoder is still mystery
  • Role of depth is also mystery

Link : https://arxiv.org/pdf/1704.03471.pdf
Authors : Belinkov et al 2017

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Abstract

  • Neural TTS model, advanced from Deep Voice 1
  • Can train Multi-speaker Embedding in single TTS model

Details

  • Complicated inference system in Neural TTS
    screen shot 2017-11-22 at 4 54 27 pm

  • Architecture for Multi-Speaker
    screen shot 2017-11-22 at 4 55 03 pm

  • Tacotron with Speaker conditioning
    screen shot 2017-11-22 at 4 55 27 pm

  • Result
    screen shot 2017-11-22 at 4 55 41 pm

Personal Thoughts

  • TTS is another mysterious, but very interesting area
  • I must revisit with good passion to implement them!

Link : https://arxiv.org/pdf/1705.08947.pdf
Authors : Arik et al. 2017

Dynamic Routing Between Capsules

Abstract

  • A capsule network proposed by Geoffrey Hinton, using layer-wise parallel attention
  • Insight from attention in human vision where irrelevant details are ignored via sequence of fixation points
  • Activities of the neuron in an active capsule represent the various properties of a particular entity that is present in the image (position, thickness, size, orientation, deformation etc)

Details

  • Routing Algorithm

    • Existing CNN simply max-pools the single scalar from matrix of numbers to extract the most impressive traits
    • Capsules pools information from previous layer's capsules via dynamic routing algorithm
      • Routing Softmax determines the initial from L to L+1 connectivity -> input in L+1 is calculated via weighted sum -> input in L+1 is squashed to 0~1 range -> routing logit from L to L+1 is updated by L's prediction and squash (~attention mechanism)
        screen shot 2017-11-07 at 9 46 49 am
  • Architecture

    • Simple 3-layer CapsNet with routing connection between Primary caps and DigitCaps only
      screen shot 2017-11-07 at 9 50 33 am
  • Result on MNIST

    • better than CNN
      screen shot 2017-11-07 at 9 51 23 am
  • What CapsNet learns

    • each dimension in DigitCaps do learn some properties
      screen shot 2017-11-07 at 9 51 45 am
  • MultiMNIST

    • learning overlapping digits
    • equivalent to SOTA ~ 5% error rate on tes set
      screen shot 2017-11-07 at 9 52 11 am

Personal Thoughts

  • 결국, Attention is how we improve neural network
  • details on training is not shared on the paper
  • how can I apply this to NMT?

Link : https://arxiv.org/pdf/1710.09829.pdf
Authors : Sabour et al. 2017

Non-Autoregressive Neural Machine Translation

Abstract

  • Non-autoregressive NMT allows an order of magnitude lower latency during inference.
  • Through knowledge distillation, the use of input token fertilities as a latent variables, and policy gradient fine-tuning, they achieve 2.0 BLEU points lower than autoregressive Transformer network used as a teacher

screen shot 2017-10-28 at 1 08 22 pm

Details

  • Towards non-autoregressive decoding
    • naive method is to predict each output independently, but this does not yield good result due to multimodality problem, a conditional independence between various target sentences.
    • for example, 'thank you' in English can be translated into 'danke schon' or 'vielen dank', but 'danke dank' is not acceptable.

Non-Autoregressive Transformer (NAT)

  • Encoder stack

    • encoder stays unchanged from original Transformer
  • Decoder stack

    • Decoder inputs

      • copy source inputs using fertilities
      • Fertilities mean the number of times each input is copied into decoder inputs, it controls the "speed" at which decoder translates and determines the length of target sentence
    • Non-causal self-attention

      • since no masking is needed for later tokens, simply use self-attention with masking out each query position only from attending to itself
      • Positional attention
        • include additional positional attention module in decoder layer, which adds stronger positional signal and allows decoder to perform local reordering
    • Fertility

      • a latent variable that models the nondeterminism in the translation process
      • sample z from a prior distribution and then condition on z to non-autoregressively generate a translation
      • with max = 50
    • Conditional Probability of a Target Translation, Y is
      screen shot 2017-10-28 at 1 20 12 pm

  • Translation Predictor and Decoding Process

    • searching for all combinations of fertility is intractable, so
      • choose argmax fertility for each input word
      • choose expected value for fertility
      • noisy parallel decoding, samples random fertility sequence and decode in parallel, then choose argmax
  • Training (read again)
    screen shot 2017-10-28 at 1 23 02 pm

    • Sequence-level Knowledge Distillation
      • use teacher-generated target corpus for training, original corpus results in too noisy and nondeterministic
    • Fine-tuning stage with reverse KL divergence with teacher output distribution in a form of word-level knowledge distillation
      • KD is favorable towards highly peaked student output distributions than a standard cross-entropy error would be
        screen shot 2017-10-28 at 1 25 37 pm
    • Joint training (read again)
      • sum of original distillation loss, expectation over fertility distribution, normalized with a baseline and other based on external fertility inference model via policy gradient and backprop
        screen shot 2017-10-28 at 1 26 43 pm
  • Experiments

    • IWSLT16 En-De as development
    • WMT14 En-De and WMT16 En-Ro as final result verification
    • use shared BPE, for IWSLT use separate vocab and embedding
    • transfer student's encoder weights from encoder weights from its teacher
    • fertility prediction is supervised with fast_align by IBM Model 2 during training
  • Results
    screen shot 2017-10-28 at 1 30 58 pm
    screen shot 2017-10-28 at 1 34 23 pm

  • NAT 2~5 BLEU less than AT

  • Speed up for > x15 over beam search in teacher model
    screen shot 2017-10-28 at 1 37 41 pm

  • Good experimental scheme

  • Fine-tune does not converge with RL or BP alone, must use all three fine-tuning terms to get +1.5 BLEU

  • Overall Structure
    screen shot 2017-10-28 at 1 43 02 pm

  • src_len vs Latency
    screen shot 2017-10-28 at 1 43 04 pm

  • learning curve for NAT (bleu on dev set)
    screen shot 2017-10-28 at 1 43 09 pm

Personal Thoughts

  • Contributions
    • Non-autoregressiveness
  • NPD is strong, let's implement
  • They tried all ideas, saw their poor performance and tried another
  • analysis of deterministicness, approximating distribution, policy gradient, roles of each module is still unkown to me
  • for naver_dic, try to reduce hyperparameters (model size, hidden size, layer, head, warmup step)
  • use their viz scheme (src_len vs latency, learning curve)

Link :
Authors :

Learning to Remember Rare Events

Abstract

  • Present a large-scale life-long memory module for use in deep learning
  • Uses nearest-neighbor algorithm (LSH for approximate version) to extract from memory
  • Memory module can be easily added to any part of a supervised neural network

Details

  • Memory has layer embedding as key, label as value and age for timestep

screen shot 2017-10-17 at 7 18 24 pm

  • Loss term used in Memory module

screen shot 2017-10-17 at 7 20 30 pm

  • If no positive-neighbor is found, update memory by adding a new key-value pair
  • If positive-neighbor is found, update the memory value by its average

screen shot 2017-10-17 at 7 18 44 pm

  • Evaluated life-long memory module in image classification, machine translation, synthetic data with Extended GPU, all having positive impacts

Personal Thoughts

  • Lukasz - God - Kaiser paper
  • it is definitely must-implement paper

Link : https://arxiv.org/pdf/1703.03129.pdf
Authors : Kaiser et al. 2017

Sequence-Level Knowledge Distillation

Abstract

  • Knowledge distillation approach applied to NMT
  • Experiment three types of knowledge distillation : word-level, sequence-level, sequence-level interpolation
  • Best student model runs 10 times faster than its sota teacher with little loss in performance
  • Network pruning can further reduce parameters with little loss in performance

Details

  • Overview of Three types of Knowledge Distillation

screen shot 2017-10-22 at 5 23 52 pm

  • Word-level KD
    • student-model learned on the word-level cross-entropy loss of teacher model

screen shot 2017-10-22 at 5 24 55 pm

  • Sequence-level KD
    • student-model learned on the new training set that is a beam-search result of teacher model
    • beam-search result of teacher model is a tractable approximation of sequence-level knowledge

screen shot 2017-10-22 at 5 25 41 pm

  • Sequence-level Interpolation KD
    • uses original training data and beam-search result of teacher-model simultaneously to balance the objective function

screen shot 2017-10-22 at 5 27 00 pm

  • Experiment on En-De
    • 4M training data, 50k word-level
    • 4 x 1,000 LSTM as teacher-model
    • 2 x 300 LSTM and 2 x 500 LSTM as student-model

screen shot 2017-10-22 at 5 28 02 pm

  • Result

    • Seq-KD on 2 x 500 LSTM has better BLEU on greedy-search, comparable BLEU on beam-search and note that greedy-search result is 16.9% compared to 1.3% in baseline (speed enhancement can be expected)
    • Word-KD, Seq-KD, Seq-Inter are complementary to each other
  • Speed

    • 2 x 500 LSTM with greedy search (1051.3) is 10 times faster than 4 x 1,000 LSTM with beam-search

screen shot 2017-10-22 at 5 31 53 pm

  • Pruning
    • Network pruning can be complementary, and further reduces param count with little loss of performance

screen shot 2017-10-22 at 5 31 56 pm

Personal Thoughts

  • Applicable to Papago NMT
    • Faster decoding time with less storage (possibly on mobile phone)

Link : https://arxiv.org/pdf/1606.07947.pdf
Authors : Kim et al. 2016

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Abstract

  • BeamSearch is an approximate inference model for intractable inference function in Sequence Models, but it often produces nearly identical sequences on top beams
  • Propose Diverse Beam Search (DBS) that decodes a diverse output by optimizing for a diversity-augmented objective
  • First paper that I've read on OpenReview, which got rejected by a small margin. I should be reading open-reviews carefully.

Details

  • Beam Search

    • Maximum a Posteriori (MAP) inference for RNNs is the task of finding the most likely output sequence given the input. Since the number of possible sequences grows as |V|^T, exact inference is NP-hard so approximate inference algorithms like Beam Search (BS) are commonly employed. BS is a heuristic graph-search algorithm that maintains the B top-scoring partial sequences expanded in a greedy left-to-right fashion.
    • At each step, select top-k beams that have maximum cumulative log probability scaled by length penalty
  • Diversity in Beam Search

    • inherently, lacks the diversity in top-k beams

screen shot 2017-10-29 at 9 29 51 pm

  • Diverse Beam Search
    • optimize an objective that consists of two terms – the sequence likelihood under the model and a dissimilarity term that encourages beams across groups to differ. This diversity-augmented model score is optimized in a doubly greedy manner – greedily optimizing along both time and groups.
    • Partition B beams into G Groups where each group is given dissimilarity term related to previous groups, hence similar token as previous groups are discouraged.
    • At each step, select top-(k/G) beams in each group that have maximum sum of cumulative log probability and dissimilarity term scaled by length penalty

screen shot 2017-10-29 at 9 35 53 pm

  • Result
    • Evaluates on Oracle BLEU which is best bleu score among top-k beams
    • This evaluation is not significant when NMT only requires top-1 beam to be the output during inference. This is where this paper had its weakness in convincing the reviewer.

screen shot 2017-10-29 at 9 38 15 pm

  • Diversity score is higher with larger groups (obviously), and Hamming diversity works best for dissimilarity function

screen shot 2017-10-29 at 9 38 37 pm

Personal Thoughts

  • Well-written paper, but sad to see that this much effort did not pass ICLR 2017
  • Cannot be used in Papago because what we need is a decoder that generates better translation on top-1, not a decoder that can give us diverse candidates
  • Perhaps, useful in generating multiple target sentences from teacher-model during knowledge distillation. NMT's biggest problem is that the single answers are given to the source-target pair which can have multiple answers, which leads to inadequate penalties serving as a noise.
    • Let's implement it and generate one-source-multiple-target NMT corpus

Link : https://arxiv.org/pdf/1610.02424.pdf
Github Link : https://github.com/ashwinkalyan/dbs
Authors : Vijayakumar et al. 2016

SYSTRAN Purely Neural MT Engines for WMT2017

Abstract

  • SYSTRAN’s submission to the WMT 2017 shared news translation task for English-German
  • Back-translation and Hyper-specialization
  • uses OpenNMT

screen shot 2017-11-03 at 10 47 45 pm

Details

  • WMT 2017 News Translation Task
    • Data 4.6M Parallel corpus

screen shot 2017-11-03 at 10 48 10 pm

  • Training

    • Nvidia GTX 1080 ~ 64 per minibatch
    • SGD (0.1) with annealing rate (0.7)
  • Back Translation

    • translating target language back into source language and using it as parallel corpus
    • Synthetically generated back-translated data 4.5M + original 4.5M after 13 epochs of original 4.5M training
    • it improves performance!

screen shot 2017-11-03 at 10 50 50 pm

screen shot 2017-11-03 at 10 50 55 pm

  • Data Selection vis LM model

    • Less data are used to fine-tune model
    • data is chosen by two 3-gram LM model trained one from news corpus and one from random sampling. When the difference of cross-entropy is big, we treat it as news related sentence and include in fine-tune corpus
  • Hyper-specialization

    • 25K news related set tuned with learning rate 0.7
    • improves BLEU by +0.3~0.5

Personal Thoughts

  • Good to see Systran openly participating and contributing to WMT2017
  • Amount of data is really strong, when generated via back-translation, distillation, monolingual!
  • Hyper-specialization is competition-fit strategy for squeezing the performance ~ likely overfitting

Link : https://arxiv.org/pdf/1709.03814.pdf
Authors : Deng et al. 2017

What does Attention in Neural Machine Translation Pay Attention to?

Abstract

  • Analyze how attention is similar or different from the traditional alignment (word)
  • Result : attention is different from alignment in some cases and is capturing useful information other than alignments (hence it's different)

Details

  • Using RWTH De-En data, measure attention-alignment accuracy
    screen shot 2017-10-27 at 9 40 21 am

  • Average attention loss/Average word prediction loss on POS tags of target side

    • Noun and Verbs are most common POS tags
    • Noun has lowest average attention loss, meaning attention is similar to alignment
    • Verb has 2x attention loss, meaning attention is quite different from alignment in verbs
      screen shot 2017-10-27 at 9 40 54 am
  • Correlation between WPE and Attention loss for input-feeding model

    • Low correlation for verbs confirm that attention to other parts of the source sentence rather than the aligned word is necessary for translating verbs
      screen shot 2017-10-27 at 9 42 20 am

Personal Thoughts

  • Of course, attention is not an alignment
  • Attention is a mix of context and meaning, it holds all the necessary components for the task in secret way
  • how is attention different in character-level/sub-word NMT?

Link : https://arxiv.org/pdf/1710.03348.pdf
Authors : Ghader et al 2017

Exploring Sparsity in Recurrent Neural Network

Abstract

  • Propose a pruning method during the training of the network
  • Time to train the model remains constant, and network size is reduced by 8x
  • Pruning a larger dense network is better than dense network with same parameter size

Details

  • Pruning Methodology

    • maintains a set of masks, a monotonically increasing threshold and a set of hyper parameters that are used to determine the threshold
    • this binary mask is multiplied to the weight every time. All weights get trained and updated (mask dynamically changes)
    • Prune fully connected layers and recurrent states. Do not prune bias or normalization parameters.
    • Setting hyper parameters
      screen shot 2017-11-07 at 9 28 10 am
  • Pruning Algorithm
    screen shot 2017-11-07 at 9 28 33 am

    • Observe the types of layers and number of parameters for your network
      screen shot 2017-11-07 at 9 29 41 am
  • Results

    • Sparse model starting with larger model performs on par with dense model with 8x reduction in param number
      screen shot 2017-11-07 at 9 30 14 am
    • Speed-up is not as expected, due to the inefficient support for cuSPARSE libraries yet to come
      screen shot 2017-11-07 at 9 31 00 am
  • Pruning percent

    • Lower layers are pruned more aggressively
    • (b) shows the number of pruned params increasing exponentially, which performs better than linear one
      screen shot 2017-11-07 at 9 31 47 am

Personal Thoughts

  • Good pruning method, easy to implement
  • Hope the speed-up is also as much as storage size improvement

Link : https://arxiv.org/pdf/1704.05119.pdf
Authors : Narang et al. 2017

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.