Coder Social home page Coder Social logo

greydanus / mnist1d Goto Github PK

View Code? Open in Web Editor NEW
142.0 5.0 27.0 8.12 MB

A 1D analogue of the MNIST dataset for measuring spatial biases and answering Science of Deep Learning questions.

License: Apache License 2.0

Python 1.03% Jupyter Notebook 98.97%
neural-networks machine-learning convnet pytorch dataset research

mnist1d's Introduction

The MNIST-1D dataset

Dec 5, 2023: MNIST-1D featured in Simon Prince's Understanding Deep Learning textbook

Blog post | Paper

Run in your browser

Overview

Machine learning models all get about the same test accuracy on MNIST. This dataset is smaller than MNIST and does a better job of separating good models from the bad.

overview.png

Dataset Logistic regression MLP CNN GRU* Human expert
MNIST 94% 99+% 99+% 99+% 99+%
MNIST-1D 32% 68% 94% 91% 96%
MNIST-1D (shuffle**) 32% 68% 56% 57% ~30%

*Training the GRU takes at least 10x the walltime of the CNN.

**The term "shuffle" refers to shuffling the spatial dimension of the dataset, as in Zhang et al. (2017).

Motivation

The original MNIST dataset is supposed to be the Drosophilia of machine learning but it has a few drawbacks:

  • Discrimination between models. The difference between major ML models comes down to a few percentage points.
  • Dimensionality. Examples are 784-dimensional vectors so training ML models can take non-trivial compute and memory (think neural architecture search and metalearning).
  • Hard to hack. MNIST is not procedurally generated so it's hard to change the noise distribution, the scale/rotation/translation/shear/etc of the digits, or the resolution.

We developed MNIST-1D to address these issues. It is:

  • Discriminative between models. There is a broad spread in test accuracy between key ML models.
  • Low dimensional. Each MNIST-1D example is a 40-dimensional vector. This means faster training and less memory.
  • Easy to hack. There's an API for adjusting max_translation, corr_noise_scale, shear_scale, final_seq_length and more. The code is clean and modular.
  • Still has some real-world relevance. Though it's low-dimensional and synthetic, this task is arguably more interesting than Sklearn's datasets such as two_moons, two_circles, or gaussian_blobs.

Getting the dataset

Here's a minimal example of how to download the dataset:

import requests, pickle

url = 'https://github.com/greydanus/mnist1d/raw/master/mnist1d_data.pkl'
r = requests.get(url, allow_redirects=True)
open('./mnist1d_data.pkl', 'wb').write(r.content)

with open('./mnist1d_data.pkl', 'rb') as handle:
    data = pickle.load(handle)
    
data.keys()

>>> dict_keys(['x', 'x_test', 'y', 'y_test', 't', 'templates'])  # these are NumPy arrays

A slightly better way to do things is to clone this repo and then use the get_dataset method in data.py to do essentially the same thing.

Dimensionality reduction

Visualizing the MNIST and MNIST-1D datasets with tSNE. The well-defined clusters in the MNIST plot indicate that the majority of the examples are separable via a kNN classifier in pixel space. The MNIST-1D plot, meanwhile, reveals a lack of well-defined clusters which suggests that learning a nonlinear representation of the data is much more important to achieve successful classification.

tsne.png

Thanks to Dmitry Kobak for this contribution.

Constructing the dataset

This is a synthetically-generated dataset which, by default, consists of 4000 training examples and 1000 testing examples (you can change this as you wish). Each example contains a template pattern that resembles a handwritten digit between 0 and 9. These patterns are analogous to the digits in the original MNIST dataset.

Original MNIST digits

mnist1d_black.png

1D template patterns

mnist1d_black.png

1D templates as lines

mnist1d_white.png

In order to build the synthetic dataset, we pass the templates through a series of random transformations. This includes adding random amounts of padding, translation, correlated noise, iid noise, and scaling. We use these transformations because they are relevant for both 1D signals and 2D images. So even though our dataset is 1D, we can expect some of our findings to hold for 2D (image) data. For example, we can study the advantage of using a translation-invariant model (eg. a CNN) by making a dataset where signals occur at different locations in the sequence. We can do this by using large padding and translation coefficients. Here's an animation of how those transformations are applied.

mnist1d_tranforms.gif

Unlike the original MNIST dataset, which consisted of 2D arrays of pixels (each image had 28x28=784 dimensions), this dataset consists of 1D timeseries of length 40. This means each example is ~20x smaller, making the dataset much quicker and easier to iterate over. Another nice thing about this toy dataset is that it does a good job of separating different types of deep learning models, many of which get the same 98-99% test accuracy on MNIST.

Example Use Cases

Quantifying CNN spatial priors

For a fixed number of training examples, we show that a CNN achieves far better test generalization than a comparable MLP. This highlights the value of the inductive biases that we build into ML models. For code, see the second half of the quickstart.

benchmarks.png

We obtain sparse "lottery ticket" masks as described by Frankle & Carbin (2018). Then we perform some ablation studies and analysis on them to determine exactly what makes these masks special (spoiler: they have spatial priors including local connectivity). One result, which contradicts the original paper, is that lottery ticket masks can be beneficial even under different initial weights. We suspect this effect is present but vanishingly small in the experiments performed by Frankle & Carbin.

lottery.png

lottery_summary.png

We replicate the "deep double descent" phenomenon described by Belkin et al. (2018) and more recently studied at scale by Nakkiran et al. (2019).

deep_double_descent.png

A simple notebook that introduces gradient-based metalearning, also known as "unrolled optimization." In the spirit of Maclaurin et al (2015) we use this technique to obtain the optimal learning rate for an MLP.

metalearn_lr.png

This project uses the same principles as the learning rate example, but tackles a new problem that (to our knowledge) has not been tackled via gradient-based metalearning: how to obtain the perfect nonlinearity for a neural network. We start from an ELU activation function and parameterize the offset with an MLP. We use unrolled optimization to find the offset that leads to lowest training loss, across the last 200 steps, for an MLP classifier trained on MNIST-1D. Interestingly, the result somewhat resembles the Swish activation described by Ramachandran et al. (2017); the main difference is a positive regime between -4 and -1.

metalearn_afunc.png

We investigate the relationship between number of training samples and usefulness of pooling methods. We find that pooling is typically very useful in the low-data regime but this advantage diminishes as the amount of training data increases.

pooling.png

Dependencies

  • NumPy
  • SciPy
  • PyTorch

mnist1d's People

Contributors

dkobak avatar greydanus avatar jakobj avatar karim-53 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mnist1d's Issues

Publication

Congrats on this dataset being prominently used in the Prince's ML textbook: https://udlbook.github.io/udlbook/ !

It made me wonder: was your paper published anywhere? It seems like it wasn't; but have you tried to publish it? I think it's really cool, so I am wondering if it kept getting rejected or if you haven't really tried to publish it.

Can not load default dataset

The default dataset seems to be broken. Not a big deal since it just automatically rebuilds it, but I though I should bring it to your attention.

from mnist1d.data import get_dataset_args, get_dataset
args = get_dataset_args()
data = get_dataset(args,  path='./mnist1d_data.pkl', download=True)
-> 
Downloading MNIST1D dataset from https://github.com/greydanus/mnist1d/raw/master/mnist1d_data.pkl   
Saving to ./mnist1d_data.pkl
Did or could not load data from ./mnist1d_data.pkl. Rebuilding dataset...

minimal repo: https://colab.research.google.com/drive/16RBVCMLXegLZ6xidI3ZBe_visT_gR5EA?usp=sharing

Packaging this dataset

Hi, thanks for contributing this dataset as open source. Motivated by "Understanding Deep Learning", I'd love to start playing with it.

I saw that the repo is not ready for packaging. As the code is rather straight forward, I was wondering if any PRs would be appreciated to package this as a python module.

This would also make it more easier to use more broadly than by using the requests package. People could just go: pip install mnist1d and have the code for augmenting and changing under their fingers. Just a thought!

Right now, I get the following:

python -m pip install git+https://github.com/greydanus/mnist1d.git@master
Collecting git+https://github.com/greydanus/mnist1d.git@master
  Cloning https://github.com/greydanus/mnist1d.git (to revision master) to /tmp/pip-req-build-eg1a_6cu
  Running command git clone --filter=blob:none --quiet https://github.com/greydanus/mnist1d.git /tmp/pip-req-build-eg1a_6cu
  Resolved https://github.com/greydanus/mnist1d.git to commit 39dd6c03785eefe60f349af94e61f864fc449644
ERROR: git+https://github.com/greydanus/mnist1d.git@master does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.