ATPapers

Worth-reading papers and related resources on attention mechanism, Transformer and pretrained language model (PLM) such as BERT.

Suggestions about fixing errors or adding papers, repositories and other resources are welcomed!

值得一读的注意力机制、Transformer和预训练语言模型论文与相关资源集合。

欢迎修正错误以及新增论文、代码仓库与其他资源等建议！

Attention

Papers

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (ICML 2015) [paper] - Hard & Soft Attention
Effective Approaches to Attention-based Neural Machine Translation (EMNLP 2015) [paper] - Global & Local Attention
Neural Machine Translation by Jointly Learning to Align and Translate (ICLR 2015) [paper]
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures (EMNLP 2018) [paper]
Phrase-level Self-Attention Networks for Universal Sentence Encoding (EMNLP 2018) [paper]
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling (ICLR 2018) [paper][code] - Bi-BloSAN
Leveraging Local and Global Patterns for Self-Attention Networks (ACL 2019) [paper] [tf code][pt code]
Attention over Heads: A Multi-Hop Attention for Neural Machine Translation (ACL 2019) [paper]
Are Sixteen Heads Really Better than One? (NeurIPS 2019) [paper]
Synthesizer: Rethinking Self-Attention in Transformer Models (CoRR 2020) [paper] - Synthesizer

Survey & Review

An Attentive Survey of Attention Models (IJCAI 2019) [paper]

English Blog

Illustrated: Self-Attention

Chinese Blog

Repositories

Keras Attention Layer - Attention layer Keras implementation

Transformer

Papers

Attention is All you Need (NIPS 2017) [paper][code] - Transformer
Weighted Transformer Network for Machine Translation (CoRR 2017) [paper][code]
Accelerating Neural Transformer via an Average Attention Network (ACL 2018) [paper][code] - AAN
Self-Attention with Relative Position Representations (NAACL 2018) [paper] [unoffical code]
Universal Transformers (ICLR 2019) [paper][code] - Universal Transformer
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (ACL 2019) [paper] - Transformer-XL
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned (ACL 2019) [paper]
Memory Transformer Networks (CS224n Winter2019 Reports) [paper]
Star-Transformer (NAACL 2019) [paper]
On Layer Normalization in the Transformer Architecture (ICLR 2020) [paper]
Transformers without Tears: Improving the Normalization of Self-Attention (IWSLT 2019) [paper][code]
Reformer: The Efficient Transformer (ICLR 2020) [paper] [code 1][code 2][code 3] - Reformer
TENER: Adapting Transformer Encoder for Named Entity Recognition (CoRR 2019) [paper]
ReZero is All You Need: Fast Convergence at Large Depth (CoRR 2020) [paper] [code] [related Chinese post] - ReZero
Lite Transformer with Long-Short Range Attention (ICLR 2020) [paper][code] - Lite Transformer
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing （ACL 2020) [paper][code] - HAT
Longformer: The Long-Document Transformer (CoRR 2020) [paper][code] - LongFormer
Improving Transformer Models by Reordering their Sublayers (ACL 2020) [paper]
Highway Transformer: Self-Gating Enhanced Self-Attentive Networks (ACL 2020) [paper][code] - Highway Transformer
Talking-Heads Attention （CoRR 2020) [paper]
Linformer: Self-Attention with Linear Complexity (CoRR 2020) [paper] - Linformer

Chinese Blog

English Blog

Repositories

DongjunLee / transformer-tensorflow - Transformer Tensorflow implementation
andreamad8 / Universal-Transformer-Pytorch - Universal Transformer PyTorch implementation
sannykim / transformers - A collection of resources to study Transformers in depth

Pretrained Language Model

Models

Deep Contextualized Word Representations (NAACL 2018) [paper] - ELMo
Universal Language Model Fine-tuning for Text Classification (ACL 2018) [paper] - ULMFit
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019) [paper][code][official PyTorch code] - BERT
Improving Language Understanding by Generative Pre-Training (CoRR 2018) [paper] - GPT
Language Models are Unsupervised Multitask Learners (CoRR 2019) [paper][code] - GPT-2
MASS: Masked Sequence to Sequence Pre-training for Language Generation (ICML 2019) [paper][code] - MASS
Unified Language Model Pre-training for Natural Language Understanding and Generation (CoRR 2019) [paper][code] - UNILM
Multi-Task Deep Neural Networks for Natural Language Understanding (ACL 2019) [paper][code] - MT-DNN
75 Languages, 1 Model: Parsing Universal Dependencies Universally[paper][code] - UDify
ERNIE: Enhanced Language Representation with Informative Entities (ACL 2019) [paper][code] - ERNIE (THU)
ERNIE: Enhanced Representation through Knowledge Integration (CoRR 2019) [paper] - ERNIE (Baidu)
Defending Against Neural Fake News (CoRR 2019) [paper][code] - Grover
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding (CoRR 2019) [paper] - ERNIE 2.0 (Baidu)
Pre-Training with Whole Word Masking for Chinese BERT (CoRR 2019) [paper] - Chinese-BERT-wwm
SpanBERT: Improving Pre-training by Representing and Predicting Spans (CoRR 2019) [paper] - SpanBERT
XLNet: Generalized Autoregressive Pretraining for Language Understanding (CoRR 2019) [paper][code] - XLNet
RoBERTa: A Robustly Optimized BERT Pretraining Approach (CoRR 2019) [paper] - RoBERTa
NEZHA: Neural Contextualized Representation for Chinese Language Understanding (CoRR 2019) [paper][code] - NEZHA
K-BERT: Enabling Language Representation with Knowledge Graph (AAAI 2020) [paper][code] - K-BERT
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (CoRR 2019) [paper][code] - Megatron-LM
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transforme (CoRR 2019) [paper][code] - T5
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (CoRR 2019) [paper] - BART
ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations (CoRR 2019) [paper][code] - ZEN
The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service (CoRR 2019) [paper][code] - BAAI-JDAI-BERT
Knowledge Enhanced Contextual Word Representations (EMNLP 2019) [paper] - KnowBert
UER: An Open-Source Toolkit for Pre-training Models (EMNLP 2019) [paper][code] - UER
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (ICLR 2020) [paper] - ELECTRA
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding (ICLR 2020) [paper] - StructBERT
FreeLB: Enhanced Adversarial Training for Language Understanding (ICLR 2020) [paper][code] - FreeLB
HUBERT Untangles BERT to Improve Transfer across NLP Tasks (CoRR 2019) [paper] - HUBERT
CodeBERT: A Pre-Trained Model for Programming and Natural Languages (CoRR 2020) [paper] - CodeBERT
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training (CoRR 2020) [paper] - ProphetNet
ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation (CoRR 2020) [paper][code] - ERNIE-GEN
Efficient Training of BERT by Progressively Stacking (ICML 2019) [paper][code] - StackingBERT
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training (CoRR 2020) [paper][code] - UNILMv2
Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space (CoRR 2020) [paper][code] - Optimus
MPNet: Masked and Permuted Pre-training for Language Understanding (CoRR 2020) [paper][code] - MPNet
Language Models are Few-Shot Learners (CoRR 2020) [paper][code] - GPT-3
SPECTER: Document-level Representation Learning using Citation-informed Transformers (ACL 2020) [paper] - SPECTER

Multi-Modal

VideoBERT: A Joint Model for Video and Language Representation Learning (ICCV 2019) [paper]
Learning Video Representations using Contrastive Bidirectional Transformer (CoRR 2019) [paper] - CBT
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (NeurIPS 2019) [paper][code]
VisualBERT: A Simple and Performant Baseline for Vision and Language (CoRR 2019) [paper][code]
Fusion of Detected Objects in Text for Visual Question Answering (EMNLP 2019) [paper][[code]](https://github.com/google-research/ language/tree/master/language/question_answering/b2t2) - B2T2
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training (AAAI 2020) [paper]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP 2019) [paper][code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representatio (CoRR 2019) [paper][code]
UNITER: Learning UNiversal Image-TExt Representations (CoRR 2019) [paper]
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval （SIGIR 2020) [paper] - FashionBERT
VD-BERT: A Unified Vision and Dialog Transformer with BERT (CoRR 2020) [paper] - VD-BERT

Multilingual

Cross-lingual Language Model Pretraining (CoRR 2019) [paper] - XLM
MultiFiT: Efficient Multi-lingual Language Model Fine-tuning (EMNLP 2019) [paper][code] - MultiFit
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization (CoRR 2020) [paper][code] - XTREME
WikiBERT Models: Deep Transfer Learning for Many Languages (CoRR 2020) [paper][code] - WikiBERT

Compression

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks (CoRR 2019) [paper]
Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System (CoRR 2019) [paper] - MKDM
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding (CoRR 2019) [paper]
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (CoRR 2019) [paper]
Small and Practical BERT Models for Sequence Labeling (EMNLP 2019) [paper]
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT (CoRR 2019) [paper] - Q-BERT
Patient Knowledge Distillation for BERT Model Compression (EMNLP 2019) [paper] - BERT-PKD
Extreme Language Model Compression with Optimal Subwords and Shared Projections (ICLR 2019) [paper]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter [paper][code] - DistilBERT
TinyBERT: Distilling BERT for Natural Language Understanding (ICLR 2019) [paper][code] - TinyBERT
Q8BERT: Quantized 8Bit BERT (NeurIPS 2019 Workshop) [paper] - Q8BERT
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (ICLR 2020) [paper][code] - ALBERT
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning (ICLR 2020) [paper][PyTorch code]
Reducing Transformer Depth on Demand with Structured Dropout (ICLR 2020) [paper] - LayerDrop
Multilingual Alignment of Contextual Word Representations (ICLR 202) [paper]
AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search (CoRR 2020) [paper] - AdaBERT
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (CoRR 2020) [paper][code] - MiniLM
FastBERT: a Self-distilling BERT with Adaptive Inference Time (ACL 2020) [paper][code] - FastBERT
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices (ACL 2020) [paper][code] - MobileBERT
DynaBERT: Dynamic BERT with Adaptive Width and Depth (CoRR 2020) [paper] - DynaBERT

Application

BERT for Joint Intent Classification and Slot Filling (CoRR 2019) [paper]
GPT-based Generation for Classical Chinese Poetry (CoRR 2019) [paper]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) [paper][code]
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring (ICLR 2020) [paper]
Pre-training Tasks for Embedding-based Large-scale Retrieval (ICLR 2020) [paper]
K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters (CoRR 2020) [paper] - K-Adapter
Keyword-Attentive Deep Semantic Matching (CoRR 2020) [paper & code] [post] - Keyword BERT
Unified Multi-Criteria Chinese Word Segmentation with BERT (CoRR 2020) [paper]
Spelling Error Correction with Soft-Masked BERT (ACL 2020) [[paper]](Spelling Error Correction with Soft-Masked BERT) - Soft-Masked BERT
DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering (ACL 2020) [paper][code] - DeFormer

Analysis & Tools

Probing Neural Network Comprehension of Natural Language Arguments (ACL 2019) [paper][code]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference (ACL 2019) [paper] [code]
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks (RepL4NLP@ACL 2019) [paper]
Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection (CICLing 2019) [paper]
Understanding the Behaviors of BERT in Ranking (CoRR 2019) [paper]
How to Fine-Tune BERT for Text Classification? (CoRR 2019) [paper]
What Does BERT Look At? An Analysis of BERT's Attention (BlackBoxNLP 2019) [paper][code]
Visualizing and Understanding the Effectiveness of BERT (EMNLP 2019) [paper]
exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models (CoRR 2019) [paper] [code]
Transformers: State-of-the-art Natural Language Processing [paper][code][code]
Do Attention Heads in BERT Track Syntactic Dependencies? [paper]
Fine-tune BERT with Sparse Self-Attention Mechanism (EMNLP 2019) [paper]
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings (EMNLP 2019) [paper]
oLMpics -- On what Language Model Pre-training Captures (CoRR 2019) [paper]
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment (AAAI 2020) [paper][code] - TextFooler
A Mutual Information Maximization Perspective of Language Representation Learning (ICLR 2020) [paper]
Cross-Lingual Ability of Multilingual BERT: An Empirical Study (ICLR2020) [paper]
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping (CoRR 2020) [paper]
How Much Knowledge Can You Pack Into the Parameters of a Language Model? (CoRR 2020) [paper]
BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations (CoRR 2020) [paper]
Contextual Embeddings: When Are They Worth It? (ACL 2020) [paper]
Adversarial Training for Large Neural Language Models (CoRR 2020) [paper][code]

Tutorial & Survey

Transfer Learning in Natural Language Processing (NAACL 2019) [paper]
Evolution of Transfer Learning in Natural Language Processing (CoRR 2019) [paper]
Transferring NLP Models Across Languages and Domains (DeepLo 2019) [paper]
Pre-trained Models for Natural Language Processing: A Survey (Invited Review of Science China Technological Sciences 2020) [paper] - *** -Embeddings in Natural Language Processing (2020) [book]

Repository

keras-bert - CyberZHG's BERT Keras implementation
BERT-keras - Separius' BERT Keras implementation
bert4keras - bojone's (苏神) BERT Keras implementation
gpt-2-Pytorch: Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation
GPT2-Chinese: Chinese version of GPT2 training code, using BERT tokenizer
OpenClap: Open Chinese Language Pre-trained Model Zoo
XLNet Chinese Pretrained Language Model
RoBERTa Chinese Pretrained Language Model
RoBERTa-wwm-base-distill Chinese
ALBERT Chinese Pretrained Language Model
tomohideshibata / BERT-related-papers
terrifyzhao/bert-utils - One line generate BERT's sent2vec for classification or matching task
hanxiao / bert-as-service - Using BERT model as a sentence encoding service
CLUEbenchmark / CLUE - Chinese Language Understanding Evaluation Benchmark
jessevig / bertviz - BERT Visualization Tool
Jiakui / awesome-bert
Tencent / TurboTransformers
THUNLP / PLMpapers
THUNLP-AIPoet / BERT-CCPoem
ZhuiyiTechnology / SimBERT

zhangkai2017 / atpapers Goto Github PK

atpapers's Introduction

ATPapers

Attention

Papers

Survey & Review

English Blog

Chinese Blog

Repositories

Transformer

Papers

Chinese Blog

English Blog

Repositories

Pretrained Language Model

Models

Multi-Modal

Multilingual

Compression

Application

Analysis & Tools

Tutorial & Survey

Repository

Chinese Blog

English Blog

atpapers's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org