The Curse of Dimensionality - Lab

Introduction

In this lab, you'll conduct some mathematical simulations to further investigate consequences of the curse of dimensionality.

Objectives

You will be able to:

Define a Euclidean Distance Function for n-dimensional space
Plot a graph displaying how sparsity increases with n for n-dimensional spaces
Demonstrate how training time increases exponentially as the number of features increases for supervised learning algorithms

Sparseness in n-Dimensional Space

As discussed, points in n-dimensional space become increasingly sparse as the number of dimensions increases. To demonstrate this, you'll write a function to calculate the euclidean distance between two points. From there, you'll then generate random points in n-dimensional space, calculate their average distance from the origin, and plot the relationship between this average distance and n.

Euclidean Distance

To start, write a function which takes two points, p1 and p2, and returns the Euclidean distance between them. Recall that the Euclidean distance between two points is given by:

$$ d(a,b) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... + (a_n - b_n)^2} $$

import numpy as np

def euclidean_distance(p1, p2):
    #Your code here

Average Distance From the Origin

To examine the curse of dimensionality, you'll investigate the average distance to the center of n-dimensional space. As you'll see, this average distance increases as the number of dimensions increases. To investigate this, generate 100 random points for various n-dimensional spaces. Investigate n-dimensional spaces from n=1 to n=1000. In each of these, construct the 100 random points using a random number between -10 and 10 for each dimension of the point. From there, calculate the average distance from each of these points to the origin. Finally, plot this relationship on a graph; the x-axis will be the n, the number of dimensions, and the y-axis will be the average distance from the origin.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

#Your code here

Convergence Time

As you've heard, another issue with increasing feature space is the training time required to fit a machine learning model. While more data will generally lead to better predictive results, it will also substantially increase training time. To demonstrate this, generate lists of random numbers as you did above. Then, use this list of random numbers as a feature in a mock dataset; choose an arbitrary coefficient and multiply the feature vector by this coefficient. Then sum these feature-coefficient products to get an output y. To spice things up (and not have a completely deterministic relationship), add a normally distributed white noise parameter to your output values. Fit an ordinary least squares model to your generated mock data. Repeat this for a varying number of features, and record the time required to fit the model. (Be sure to only record the time to train the model, not the time to generate the data.) Finally, plot the number of features, n, versus the training time for the subsequent model.

import pandas as pd
import datetime
from sklearn.linear_model import LinearRegression, Lasso

#Your code here

Repeat the Same Experiment for a Lasso Penalized Regression Model

#Your code here

Optional: Show Just How Slow it Can Go!

If you're up for putting your computer through the ringer and are very patient to allow the necessary computations, try increasing the maximum n from 1000 to 10,000 using Lasso regression. You should see an interesting pattern unveil. See if you can make any hypotheses as to why this might occur!

#Your code here

Summary

In this lab, you conducted various simulations to investigate the curse of dimensionality. This demonstrated some of the caveats of working with large datasets with an increasing number of features. With that, the next section will start to explore Primary Component Analysis, a means of reducing the number of features in a dataset while preserving as much information as possible.

keita-dc / dsc-curse-of-dimensionality-lab-dc-ds-060319 Goto Github PK

dsc-curse-of-dimensionality-lab-dc-ds-060319's Introduction

The Curse of Dimensionality - Lab

Introduction

Objectives

Sparseness in n-Dimensional Space

Euclidean Distance

Average Distance From the Origin

Convergence Time

Repeat the Same Experiment for a Lasso Penalized Regression Model

Optional: Show Just How Slow it Can Go!

Summary

dsc-curse-of-dimensionality-lab-dc-ds-060319's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent