Course notes and notebooks to teach the fundamentals of how deep learning works; uses PyTorch.
At the start of the seminar, we will go through a crash course in machine learning and the basics of deep learning. See Crash course slides (PDF). Then, we'll jump immediately into lecture/lab using the notebooks in the next section.
- intro-regression-training-cars.ipynb
Load toy cars data set and train regression models to predict miles per gallon (MPG) through a variety of techniques. We start out doing a brute force grid search of many different slope and intercept (m, b) model parameters, looking for the best fit. Then we manually compute partial derivatives of the loss function and perform gradient descent using plain numpy. We look at the effect on the loss function of normalizing numeric variables to have zero mean and standard deviation one. Finally, this notebook shows you how to use the autograd (auto differentiation) functionality of pytorch as a way to transition from numpy to pytorch training loops. - pytorch-nn-training-cars.ipynb
Once we can implement our own gradient descent using pytorch autograd and matrix algebra, it's time to graduate to using pytorch's built-in neural network module and the built-in optimizers (e.g., Adam). Next, we observe how a sequence of two linear models is effectively the same as a single linear model. After we add a nonlinearity, we see more sophisticated curve fitting. Then we see how a sequence of multiple linear units plus nonlinearities affects predictions. Finally, we see what happens if we give a model too much power: the regression curve over fits the training data. - train-test-diabetes.ipynb
This notebook explores how to use a validation set to estimate how well a model generalizes from its training data to unknown test vectors. We will see that deep learning models often have so many parameters that we can drive training loss to zero, but unfortunately the validation loss grows as the model overfits. We will also compare how deep learning does compared to a random forest model as a baseline. - batch-normalization.ipynb (optional)
- binary-classifier-wine.ipynb
Shifting to binary classification now, we consider the toy wine data set and build models that use features proline and alcohol to predict wine classification (class 0 or class 1). We will add a sigmoid activation function to the final linear layer, which will give us the probability that an input vector represents class 1. A single linear layer plus the sigmoid yields a standard logistic regression model. By adding another linear layer and nonlinearity, we see a curved decision boundary between classes. By adding lots of neurons and more layers, we see even more complex decision boundaries appear. - multiclass-classifier-mnist.ipynb
To demonstrate k class classification instead of binary classification, we use the traditional MNIST digital image recognition problem. We'll again use a random forest model as a baseline classifier. Instead of a sigmoid on a single output neuron, k class classifiers use k neurons in the final layer and then a softmax computation instead of a simple sigmoid. We see fairly decent recognition results with just 50 neurons. By using 1000 neurons, we get slightly better results. To demonstrate cyclic learning rates, which sometimes helps to find more general solutions, there's a final model using pytorch's CyclicLR learning rate scheduler. - gpu-mnist.ipynb
This notebook redoes the examples from the previous MNIST notebook but using the GPU to perform matrix algebra in parallel. We use.to(device)
on tensors and models to shift them to the memory on the GPU. The model trains much faster using the huge number of processors on the GPU. You will need to run the notebook at colab or from an AWS machine to get access to a GPU. - SGD-minibatch-mnist.ipynb
We have been doing batch gradient descent, meaning that we compute the loss on the complete training set as a means to update the parameters of the model. If we process the training data in chunks rather than a single batch, we call it mini-batch gradient descent, or more commonly stochastic gradient descent (SGD). It is called stochastic because of the imprecision and, hence, randomness introduced by the computation of gradients on a subset of the training data. We tend to get better generalization with SGD; i.e., smaller validation loss. - data-loaders.ipynb (optional)