CSC412 Computer Science Communication

Student Name: Aparna Gopalakrishnan

Student Number: 1004692941

I am comfortable sharing my submissions:

Peers on Forum
Anonymously via Course Twitter: @ProbablyLearn
My Personal Twitter: @aparna_gee (which @ProbablyLearn can retweet!)

This submission will consist of memes with short explanations for each.

Variable Elimination (VE) is an exact inference algorithm on graphical models. It aims to compute joint distribution of random variables in the graphical model by summing out:

$P(Q, Y_1 = v_1, \dots, Y_j = v_j ) = \sum_{Z_k} \dots \sum_{Z_1} P(X_1, \dots, X_n)_{Y_1 = v_1, \dots, Y_j = v_j}$

where $Y_i$ are observed variables, $Q$ is the query variable, and $Z_i$ are the remaining variables (neither observed nor queried). The complexity of VE relies on the ordering chosen while summing, which is exponential in worst case. Minimum degree ordering is one such 'good' ordering: this ordering aims to reduce storage and computation requirements by reducing the number of non-zero factors in the decomposition of Hermitian, positive-definite matrices into products of a lower triangular matric and its conjugate transpose. While there are heuristics to find a good ordering (like minimum degree ordering), finding the optimal ordering is still an NP-hard problem.

Generative Adversarial Networks (GANs) is a generative modelling approach which aims to model the distribution of the data itself. GANs are composed of a generator network and a discriminator network. The generator takes a random noise vector as input and aims to generate a sample in the input domain through learning the latent parameters that define the data distribution. The discriminator takes an sample point - either a 'real' point from the input domain or one generated by the generator - and tries to differentiate between them, i.e. predicts a binary class label (real/fake). These networks are trained together: the generator generates samples which are given to the discriminator along with real examples. The networks learn by how well they can 'fool' each other - discriminator is trained to get better at differentiating between real/fake samples and generators are updated on how successful they are at fooling the discriminator.

Mode collapse refers to when the generator produces outputs with little to no diversity which is good at fooling the discriminator. One of the loss measures used in training is (reverse) KL-divergence: for random variable $X$ , $D_{KL}(p_{\text{model}(X)}||p_{\text{target}(X)}) = \sum_{x \in X} p_{\text{model}}(x) \log \left( \frac{p_{\text{model}}(x)}{p_\text{data}(x)} \right)$

which encourages the learned distribution to model a mode of the target distribution resulting in low-diversity sampling, i.e. mode collapse. (Theis et. al 2016).

We are already familiar with intractable integrals/sums when dealing with sampling from probability distributions or in optimization. Monte Carlo Markov Chain (MCMC) methods like Metropolis-Hasting sampling create chains of samples from a continuous random variable whose pdf is proportional to a tractable integral and estimate the intractable integral as an expectation using these samples. Using the law of large numbers, if more steps are included in the chain, the closer is the estimate.

MCMC methods were created to handle sampling from higher-dimensional intractable distributions, but suffer from the curse of high dimensionality - the exponential increase in volume due to increase in dimension results in a concentration of the majority of the mass of the posterior probability distribution away from its mode in its typical set. This occurs since the increase in volume dominates the density. MCMC methods try to overcome this difficulty by exploiting the structure of the posterior distribution to force sampling from a small subset of the overall distribution. This could lead to the implicit assumption that the samples are generated from a single mode, i.e. that the distribution does not have multiple modes.

Parallel WaveNet is a state-of-the-art realistic speech synthesis model which improved the efficiency of original WaveNet model (by over 1000 times!): the original model is a convolutional autoregressive (causal/masked model depending on previous values/entries) network modelling joint distribution of high-dimensional data as a product of conditional distributions:

$p(\mathbf{x}) = \prod_{t} p(x_t | x_{< t}, \theta)$

where $\mathbf{x}, \theta$ are the parameters of the model receiving $x_{< t}$ as input and outputs a distribution over possible $x_{t}$ .

This structure allows for efficient parallel training since it can process its input in parallel, but leads to slow and sequential sample generation since $x_{t}$ sample is required for producing $x_{>t}$ .

Parallel WaveNet aims to achieve efficiency in both sampling and training by using Inverse-autoregressive Flows (IAFs). IAFs are a type of normalizing flow which model a multivariate distribution $p_X(x)$ as an non-linear transformation $f$ of a simple tractable distribution $p_Z(z)$ since the resulting random variable $\mathbf x = f (z)$ has log probability:

$\log p_X(x) = \log p_Z(z) - \log \left | \frac{dx}{dz} \right |$

The chosen $f$ is invertible with the determinant of its Jacobian ( $\left | \frac{dx}{dz} \right |$ ) being easy to compute (like a triangular matrix so the determinant is simply the product of its diagonal entries). Parallel WaveNet is able to achieve efficiency in both sampling and training using probability density distillation:

A fully-trained WaveNet model is used to teach a smaller and parallelized “student” network. In training, the student network is given random noise $\mathbf{z} \sim \text{Logistic}(0, I)$ as input to which the following transformation is applied:

$x_t = z_t \cdot s(z_{< t}, \theta) + \mu (z_{<t},\theta)$

producing outputs: sample $\mathbf{x}, \mu, \mathbf{s}$ , where $\mu(z_{<t}, \theta)$ and $\mathbf{s}(z_{<t},\theta)$ have the convolutional autoregressive network structure of the original WaveNet.

The student network aims to match the teacher's performance (as opposed to the zero-sum adversarial game in GANs). The aim is to minimize

$D_{KL}(P_S||P_T) = H(P_S,P_T) - H(P_S)$

where $H(P_S,P_T)$ is the cross-entropy between the student $P_S$ and teacher $P_T$ , and $H(P_S)$ is the entropy of the student distribution which prevents the student from collapsing to the teacher's mode. (Oord et. al)

Hidden Markov Models (HMMs) is a (directed) graphical model describing a Markov process: a single hidden or unobservable discrete random variable $X$ and discrete observed $Y$ whose behaviour depends on $X$ . $X_t, Y_t$ denote the value of random variables $X, Y$ at state $t$ . Additionally, the probability distribution of $Y_{t+1}$ only depends on $X_{t} = x_t$ , i.e. only depends on previous state. HMMs are repsresented by transition probabilities $P(Y_{t+1}|Y_{t})$ , the observation probabilities $P(Y_t|X_t)$ , and the initial state distribution $P(X_1)$ .

A Dynamic Bayesian Network is a graphical model representing conditional independencies between a set of sequential/time series random variables using a directed acyclic graph. HMMs are a special case of Dynamic Bayesian Networks (DBNs) since they represent a restrictive type of system. For example, DBNs allow more than 1 hidden variable and can represent continuous random variables as well, not just discrete random variables. DBNs allow us to extend Markov models with higher order connections: for example, connections from (i.e. dependency between) $Y_{t - k}, \dots Y_{t - 1}$ to $Y_t$ . (Ghahramani 1997)

KL-divergence, a measure of the difference between two probability distributions is defined as

$D_{KL}(P||Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}$

It is always non-negative i.e. $D_{KL}(P||Q) \leq 0$ , but does not satisfy symmetricity and triangle inequality properties of a metric, i.e.

$D_{KL}(P||Q) \neq D_{KL}(Q||P)$

However, KL-divergence can be modified to satisfy the symmetricity condition as follows:

$D_{JS} = \frac{1}{2} D_{KL}(P||\frac{1}{2}(P + Q)) + \frac{1}{2}D_{KL}(Q||\frac{1}{2}(P + Q))$

as expected information gain about $X$ from discovering which probability distribution $X$ is drawn from, $P$ or $Q$ , if both have probabilities 0.5. This gives the Jensen-Shannon Divergence which is symmetric.

Extra stray memes: Here are some extra memes that I made just for fun (also because I wanted to use more It's Always Sunny meme templates but most of them are too inappropriate):

aparnagopalakrishnan7 / csc412-comscicom Goto Github PK

csc412-comscicom's Introduction

CSC412 Computer Science Communication

Student Name: Aparna Gopalakrishnan

Student Number: 1004692941

csc412-comscicom's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent