ICLR 2020
Lalibela, Ethiopia by Trevor Cole on Unsplash.
Selections from ICLR 2020 (mostly NLP-related)
See also: https://www.theexclusive.org/2020/05/virtual-iclr.html
- Workshops
- Socials
- Keynotes
- Papers
- Adversarial ML and Robustness
- Neural Network Architectures
- Compositionality
- Emergent Language
- Explainability
- Graphs
- Graph Neural Networks
- Knowledge Graphs
- Learning with Less Labels
- Language Models and Transformers
- Reasoning
- Reinforcement Learning
- Style Transfer and Generative Models
- Text Generation
- Miscellaneous
Workshops
- Causal Learning For Decision Making (videos)
- Towards Trustworthy ML: Rethinking Security and Privacy for ML (videos) "Bring together experts from a variety of communities (ML, computer security, data privacy, fairness, & ethics) to work on promising ideas and research directions."
- Bridging AI and Cognitive Science (BAICS) (videos)
- Fundamental Science in the era of AI
- Neural Architecture Search (videos)
- Integration of Deep Neural Models and Differential Equations (videos)
- Beyond 'tabula rasa' in reinforcement learning: agents that remember, adapt, and generalize (videos)
- AI for Overcoming Global Disparities in Cancer Care (videos)
- AI for Affordable Healthcare (videos)
- AI for Earth Sciences (videos)
- Computer Vision for Agriculture (CV4A) (videos)
- Tackling Climate Change with ML (videos)
- ML-IRL: Machine Learning in Real Life
- AfricaNLP - Unlocking Local Languages (videos)
- Practical ML for Developing Countries: learning under limited/low resource scenarios (videos)
Some workshop talks
-
Gopnik, Causal Learning in Children: When children are better learners than adults - or AI
-
Karimi, Algorithmic Recourse: from Counterfactual Explanations to Interventions (paper)
-
Neubig, The Low-resource Natural Language Processing Toolbox, 2020 Version
Workshops, see later
Socials
Lots of "social" events, both topic and demographic based:
- Topics in Language Research
- Learning Representation for Cybersecurity
- Research with
π€ Transformers - ICLR Town:
Keynotes
-
Leslie Kaelbling: Doing for Our Robots What Nature Did For Us
-
Ruha Benjamin: 2020 Vision: Reimagining the Default Settings of Technology & Society
- A discussion on how even apparently neutral technology can perpetuate discrimination. Technologists and researchers should be aware of the societal consequences of their work.
-
Mihaela van der Schaar: Machine Learning: Changing the future of healthcare
-
Yann LeCun and Yoshua Bengio: Reflections from the Turing Award Winners
-
Yann LeCunn: "The future is self-supervised". Challenges for Deep Learning: (1) Learning with less labeled data (self-supervised learning!), (2) how to make reasoning compatible with gradient-based learning, i.e., beyond 'system 1', and (3) learning complex (hierarchical) action sequences (nothing to say here). Mostly a discussion of energy-based models (not too different from previous talks). "Could energy-based SSL be a basis for common sense?"
-
Yoshua Bengio: "Deep learning priors associated with conscious processing". Similar to this other recent talk.
- ML and Consciousness ("Consciousness Prior")
- The need for systematic generalization by dynamically recombining existing concepts, but avoiding the pitfalls of classical AI (e.g., need uncertaining handling, distributed representation, efficient search, grounding in 'system 1' and large-scale training).
-
-
Michael I. Jordan: The Decision-Making Side of Machine Learning: Dynamical, Statistical and Economic Perspectives Note: see also Artificial IntelligenceβThe Revolution Hasnβt Happened Yet
Papers
Observations: there was less of a distinction between posters and orals than in an IRL conference, as posters were just short talks. I thought the "poster" format worked very well, but I was much less likely to interact with the authors than in a non-virtual conference.
Some popular topics: reinforcement learning, adversarial ML, graph neural networks.
See also:
π Adversarial ML and Robustness
TL;DR: "We propose the first algorithm for verifying the robustness of Transformers."
Problem: models often "latch onto" spurious correlations: features that work on most training examples but don't solve the problem as we would expect. E.g., image classification -- waterbird and water background often (but not always co-occur). Overall accuracy may be high, but worst-group accuracy (e.g., waterbirds on land) can be very low.
Goal: achieve models that are more robust to spurious correlations with lower worst-group error.
Solution: Group distributionally robust optimization (DRO): minimize the worst-group's average loss, rather than the (overall) average loss. This requires knowing groups (attributes and labels) for each training example (but not at test time). A stochastic optimization algorithm is proposed and convergence guarantees are derived.
But: the worst-group error of Group DRO (at test time) is still high, i.e.,poor generalization! Previous work on small convex or generative models says this shouldn't happen. This happens because the models are SOTA overparametrized neural networks. To solve this, use stronger regularization than usual (L2 penalty).
Evaluation: Two image classification datasets (CelebA and Waterbirds) and one NLI dataset (MultiNLI).
The remaining papers deal with adversarial ML in computer vision:
Problem: adversarial examples (for images) are often created using perturbations within a small ball; it is easy to defend against them using JPEG compression or randomized smoothing.
Contributions: introduce "semantically motivated" adversarial perturbations (manipulating color and texture) with no l_p bounds (unlike most perturbations in the literature, these are large, structured, explainable). It is shown these fool some common defenses (JPEG 75, Feature Squeezing, and adversarially-trained models).
- Colorization attack: use a pre-trained colorization model and "color hints" to colorize an image in order to fool the classifier. But need to do it carefully in order to keep the colors similar to the original colors.
- Texture attack: style transfer (transfer texture from another image). This works best with an image from the target adversarial class, but with similar features to the original image.
Evaluation:
- Misclassification rate under various defenses. Also, attacks transfer.
- User study: humans have difficulty in detecting the attack.
- Caption attack: these adversarial images also fool image captioning systems! E.g., "A man is holding an apple" -> "A dog is holding an apple").
TL;DR: "We propose a novel combination of adversarial training and provable defenses which produces a model with state-of-the-art accuracy and certified robustness on CIFAR-10."
TL;DR: "FGSM-based adversarial training, with randomization, works just as well as PGD-based adversarial training: we can use this to train a robust classifier in 6 minutes on CIFAR10, and 12 hours on ImageNet, on a single machine." (cheaper than PGD).
π Neural Network Architectures
NOTE: Transformers and Graph Neural Networks get their own categories.
Presents a new architecture which simulates a Universal Turing Machine.
Initial motivation -- input embeddings for language models are based on the average context; it might be better (particularly for verbs and function words) to use the actual context. But forget this! "Mogrify" the LSTM by adding more than one round of gating. This achieves lower perplexity than LSTMs and Transformer XL (on Penn Treebank and Wikitext-2).
Why does the Mogrifier work? There are many plausible reasons, none of them fully convincing. On a synthetic dataset, the Mogrifier LSTM also outperforms the LSTM (with larger gains for larger vocabulary size). "Sadly, we could not escape the deep learning pit and a convincing explanation remained elusive".
TL;DR: "A self-attention layer can perform convolution and often learns to do so in practice."
Transformers are great at NLP tasks. They can also reach SOTA accuracy on vision tasks (Bello et al. 2019; Ramachandran et al., 2019). Why does self-attention work so well for images? This paper shows that multi-head self-attention can express convolutions.
π Compositionality
TL;DR: "We propose a link between permutation equivariance and compositional generalization, and provide equivariant language models."
Compositionality example: if one understands "Today I will run twice" and "I walk to school every day", one should also understand "I will have to walk twice around the store". The SCAN benchmark: machine translation between simple natural language commands (e.g., "jump", "walk left", "turn right twice") and 'machine actions' (e.g., JUMP, LTURN WALK, RTURN RTURN RTURN).
TL;DR: "Benchmark and method to measure compositional generalization by maximizing divergence of compound frequency at small divergence of atom frequency."
Compositional Generalization: ability to generalize to unseen combinations of known components (atoms).
Goal: want to measure how much compositional generalization is required for a given train/test split.
"Compound divergence": a more comprehensive measure than previous approaches, assuming that (1) all test atoms occur in training, (2) the distribution of atoms is similar in train and test and (3) distribution of compounds is different between train and test. Compound divergence correlates well with previous ad-hoc methods.
Evaluation: Compositional Freebase Questions (CFQ) and SCAN. An LSTM+attention, Transformer and Universal Transformer are compared. Compound Divergence is a great predictor of accuracy! Current systems fail to generalize compositionally, even with large training data, while random split is easy. (But it appears Transformers outperform LSTM+attention by a wide margin for almost every value of compound divergence -- see also results on syntactic generalization in https://arxiv.org/pdf/2005.03692.pdf )
TL;DR: "We isolate the environmental and training factors that contribute to emergent systematic generalization in a situated language-learning agent."
From the discussion: see An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution.
Goal: continually learn new words in seq2seq tasks (e.g., instruction learning using the SCAN dataset and machine translation).
π Emergent Language
TL;DR: "Use iterated learning framework to facilitate the dominance of high compositional language in multi-agent games."
π Explainability
TL;DR: "Humans in the loop revise documents to accord with counterfactual labels, resulting resource helps to reduce reliance on spurious associations."
See also: Evaluating NLP Models via Contrast Sets
TL;DR: "A method to explain a classifier, by generating visual perturbation of an image by exaggerating or diminishing the semantic features that the classifier associates with a target label."
Creating image counterfactuals with GANs.
TL;DR: "A novel deep interpretable architecture that achieves state of the art on three large scale univariate time series forecasting datasets."
TL;DR: "We propose measurement of phrase importance and algorithms for hierarchical explanation of neural sequence model predictions."
π Graphs
π Graph Neural Networks
There were a lot of papers about GNNs. Here is a few I found interesting:
π Knowledge Graphs
π Learning with Less Labels
π Language Models and Transformers
NOTE: here is another summary of some of the papers on Transformers.
From the discussion: see also "Forgetting Exceptions is Harmful in Language Learning" (1998) https://arxiv.org/abs/cs/9812021, on the same theme of generalization vs memorization.
From the discussion:
From the discussion:
π Reasoning
π Reinforcement Learning
π Style Transfer and Generative Models
TL;DR: "We formulate a probabilistic latent sequence model to tackle unsupervised text style transfer, and show its effectiveness across a suite of unsupervised text style transfer tasks."
x
TL;DR: "Stochastic style transfer with adjustable features."
TL;DR: "A model to control the generation of images with GAN and beta-VAE with regard to scale and position of the objects."
x
π Text Generation
Evaluation: Text infilling task (on SWAG and Daily Dialogue datasets); this method outperforms unidirectional decoding baselines.
From the discussion:
The "nucleus sampling" (top-p sampling) paper.
π Miscellaneous
TL;DR: "We represent a computer program using a set of simpler programs and use this representation to improve program synthesis techniques."
TL;DR: "We show that there is a hidden generative model inside of every classifier. We demonstrate how to train this model and show the many benefits of doing so."
x
TL;DR: "This paper proposes a meta-learning objective based on speed of adaptation to transfer distributions to discover a modular decomposition and causal variables."
x
Have we "almost solved video understanding"? 3D convolutional models (which take time into account) only perform slightly better than their 2D counterparts. But the temporal aspect of frames is essential. Real world video understanding requires reasoning about object permanence, estimating intentions, causal reasoning.
This paper presents a new dataset, CATER (Compositional Actions and Temporal Reasoning) and a series of benchmark tasks on the dataset which requires temporal reasoning to solve. E.g., predict "rorate(cube) after slide(cone)" from the video clip. SOTA models struggle against temporal reasoning.