Coder Social home page Coder Social logo

ds-support-vector-machines-lab-data-science's Introduction

Support Vector Machines Lab

1. First thing's first: generate a data set in scikit learn

Support Vector Machines can be used on problem cases where we have an $n$-dimensional feature space. For teaching purposes, however, it is very intuitive to use a 2-dimensional feature space so you can see what exactly is going on when using support vector machines.

Scikit-Learn has excellent data sets generator. one of them is make_blobs, another one is make_moons. Let's generate four data sets which we'll all analyze using support vector machines.

Run the cell below to create and plot some sample data sets.

from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
%matplotlib inline  
import numpy as np

plt.figure(figsize=(10, 10))

plt.subplot(221)
plt.title("Two blobs")
X, y = make_blobs(n_features = 2, centers = 2, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)

plt.subplot(222)
plt.title("Two blobs with more noise")
X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=2.8,  random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)

plt.subplot(223)
plt.title("Three blobs")
X, y = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=0.5,  random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)

plt.subplot(224)
plt.title("Two interleaving half circles")
X, y = make_moons(n_samples=100, shuffle = False , noise = 0.3, random_state=123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)

plt.show()

2. Two groups

2.1 Max margin

Let's have a look at our first plot again. We'll start with this data set and fit a simple linear support vector machine on these data. You can use the scikit-learn function svm.SVC to do that!

X, y = make_blobs(n_features = 2, centers = 2, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)

In the cell below:

  • Import the svm module from sklearn
  • Create a svc object (short for "Support Vector Classifier")
  • Fit it to the data we created in the cell above (X and y).
# from sklearn .... # uncomment and finish this import statement

clf = None
# clf.fit(None, None)

Let's save the first feature (on the horizontal axis) as X1 and the second feature (on the vertical axis) as X2.

X1= X[:,0]
X2= X[:,1]

Next, let'x store the minimum and maximum values X1 and X2 operate in. We'll add some slack (1) to the min and max boundaries.

# plot the decision function
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1

Let's see if what we just did makes sense. Have a look at your plot and verify the result!

print(X1_max)

Next, we'll create a grid. You can do this by using the numpy function linspace, which creates a numpy array with evenly spaced numbers over a specified interval. The default of numbers is 50 and we don't need that many, so let's specify num = 10 for now. You'll see that you need to up this number one we get to the classification of more than 2 groups.

In the cell below:

  • Set each of the following coordinate variables using np.linspace().
    • For x1, pass in the appropriate min and max values, along with the constant, '10'.
    • For x2, pass in the appropriate min and max values, along with the constant, '10'.
x1_coord = None
x2_coord = None

Now, run the following cells:

X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)
x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]

Let's now get the coordinates of the decision function. Run the cells below.

# df = clf.decision_function(x1x2).reshape(X1_C.shape)
# plt.scatter(X1, X2, c = y)
# axes = plt.gca()
# axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
# plt.show()

The coordinates of the support vectors can be found in the support_vectors_ attribute:

# clf.support_vectors_

Run the cell below to create your plot again, but with highlighted support vectors.

# plt.scatter(X1, X2, c = y)
# axes = plt.gca()
# axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
# axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue') 
# plt.show()

2.2 Soft margin tuning

The previous example was pretty easy. The 2 "clusters" were easily separable by one straight line classifying every single instance correctly. But what if this isn't the case? Let's have a look at the second dataset we had generated:

Run the cell below to recreate and plot our second dataset from above.

X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=2.8,  random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)

In the cell below, repeat the entire process from above. We're doing the exact same thing as we did above, but to a different dataset--feel free to copy and paste the code you wrote above.

plt.scatter(X[:, 0], X[:, 1], c=y, s=25)

from sklearn import svm

clf = None
# clf.fit(None, None)

X1= None
X2= None
X1_min, X1_max = None, None
X2_min, X2_max = None, None

x1_coord = None
x2_coord = None

X2_C = None
X1_C = None

x1x2 = None

# df = clf.decision_function(x1x2).reshape(X1_C.shape)

# plt.scatter(X1, X2, c = y)
# axes = plt.gca()
# axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
# plt.show()

As you can see, 3 instances are misclassified (1 yellow, 2 purple). The reason for this is that in scikit learn, the svm module automatically allows for slack variables. If we want to make sure we have as few misclassifications as possible, we should set a bigger value for C, the regularization parameter.

Now, we run the same code again, except with a different value for the C parameter passed in at initialization for our svc object. Run the cell below and see how our decision boundaries change.

X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=2.8,  random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)

from sklearn import svm

clf = svm.SVC(kernel='linear', C=1000000) 
clf.fit(X, y)
# Start Reusable Section
X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1

x1_coord = np.linspace(X1_min, X1_max, 10)
x2_coord = np.linspace(X2_min, X2_max, 10)

X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)

x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]

df = clf.decision_function(x1x2).reshape(X1_C.shape)

plt.scatter(X1, X2, c = y)
axes = plt.gca()
axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
plt.show()
# End Resuable Section

3. More than 2 groups

We'll now repeat the same process as above, but on our 3rd dataset. This dataset contains classes, turning this from a Binary Classification to a Multiclass Classification problem.

Run the cell below to recreate and plot the 3rd dataset we created at the beginning of this lab.

X, y = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=0.5,  random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)

Now, we'll repeat the same process as we did above. In the cell below:

  • Create a SVC object. Set the kernel to "linear", and C to 20.
  • fit the model to to X and y.
clf = None
# clf.fit(None, None)

Now, run the cell below to plot the decision boundaries for our multiclass dataset.

X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1

x1_coord = np.linspace(X1_min, X1_max, 200)
x2_coord = np.linspace(X2_min, X2_max, 200)

X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)

x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]

# Z = clf.predict(x1x2).reshape(X1_C.shape)

# axes = plt.gca()
# axes.contourf(X1_C, X2_C, Z, alpha = 1)
# plt.scatter(X1, X2, c = y, edgecolors = 'k')
# axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue', edgecolors= 'k') 
# plt.show()

4. The kernel trick

It would probably be nicer to have non-linear decision boundaries here, let's have a look at that! You can also see how your support vectors are changing.

Run the cell below. Notice that the only substantial change is to the kernel parameter--here, we have changed it from linear to rbf (for "Radial Basis Function").

X, y = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=0.5,  random_state = 123)

clf = svm.SVC(kernel='rbf', C=20) 
clf.fit(X, y)

X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1

x1_coord = np.linspace(X1_min, X1_max, 500)
x2_coord = np.linspace(X2_min, X2_max, 500)

X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)

x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]

Z = clf.predict(x1x2).reshape(X1_C.shape)

axes = plt.gca()
axes.contourf(X1_C, X2_C, Z, alpha = 1)
plt.scatter(X1, X2, c = y, edgecolors = 'k')
axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue', edgecolors= 'k') 
plt.show()

Let's see one more example of using the kernel trick to find non-linear decision boundaries. Run the cell below to create another sample dataset, fit an SVM using a non-linear kernel, and plot the decision boundaries.

As we did in previous examples, we have highlighted our support vectors in blue.

X, y = make_moons(n_samples=100, shuffle = False , noise = 0.3, random_state=123)

clf = svm.SVC(kernel='rbf', C=20) 
clf.fit(X, y)

X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1

x1_coord = np.linspace(X1_min, X1_max, 500)
x2_coord = np.linspace(X2_min, X2_max, 500)

X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)

x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]

Z = clf.predict(x1x2).reshape(X1_C.shape)

axes = plt.gca()
axes.contourf(X1_C, X2_C, Z, alpha = 1)
plt.scatter(X1, X2, c = y, edgecolors = 'k')
axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue', edgecolors= 'k') 
plt.show()

Sources

https://machinelearningmastery.com/generate-test-datasets-python-scikit-learn/

http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html

https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html

http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html

ds-support-vector-machines-lab-data-science's People

Contributors

mike-kane avatar

Watchers

 avatar Rishikesh Tirumala avatar James Cloos avatar  avatar Victoria Thevenot avatar  avatar Joe Cardarelli avatar Sam Birk avatar Sara Tibbetts avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar Antoin avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar A. Perez avatar Nicole Kroese  avatar Lore Dirick avatar Nicolas Marcora avatar Lisa Jiang avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.