Support Vector Machines can be used on problem cases where we have an
Scikit-Learn has excellent data sets generator. one of them is make_blobs
, another one is make_moons
. Let's generate four data sets which we'll all analyze using support vector machines.
Run the cell below to create and plot some sample data sets.
from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
plt.figure(figsize=(10, 10))
plt.subplot(221)
plt.title("Two blobs")
X, y = make_blobs(n_features = 2, centers = 2, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)
plt.subplot(222)
plt.title("Two blobs with more noise")
X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=2.8, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)
plt.subplot(223)
plt.title("Three blobs")
X, y = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=0.5, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)
plt.subplot(224)
plt.title("Two interleaving half circles")
X, y = make_moons(n_samples=100, shuffle = False , noise = 0.3, random_state=123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)
plt.show()
Let's have a look at our first plot again. We'll start with this data set and fit a simple linear support vector machine on these data. You can use the scikit-learn function svm.SVC
to do that!
X, y = make_blobs(n_features = 2, centers = 2, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c = y, s=25)
In the cell below:
- Import the
svm
module from sklearn - Create a
svc
object (short for "Support Vector Classifier") - Fit it to the data we created in the cell above (
X
andy
).
# from sklearn .... # uncomment and finish this import statement
clf = None
# clf.fit(None, None)
Let's save the first feature (on the horizontal axis) as X1 and the second feature (on the vertical axis) as X2.
X1= X[:,0]
X2= X[:,1]
Next, let'x store the minimum and maximum values X1 and X2 operate in. We'll add some slack (1) to the min and max boundaries.
# plot the decision function
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1
Let's see if what we just did makes sense. Have a look at your plot and verify the result!
print(X1_max)
Next, we'll create a grid. You can do this by using the numpy function linspace
, which creates a numpy array with evenly spaced numbers over a specified interval. The default of numbers is 50 and we don't need that many, so let's specify num = 10
for now. You'll see that you need to up this number one we get to the classification of more than 2 groups.
In the cell below:
- Set each of the following coordinate variables using
np.linspace()
.- For x1, pass in the appropriate min and max values, along with the constant, '10'.
- For x2, pass in the appropriate min and max values, along with the constant, '10'.
x1_coord = None
x2_coord = None
Now, run the following cells:
X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)
x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]
Let's now get the coordinates of the decision function. Run the cells below.
# df = clf.decision_function(x1x2).reshape(X1_C.shape)
# plt.scatter(X1, X2, c = y)
# axes = plt.gca()
# axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
# plt.show()
The coordinates of the support vectors can be found in the support_vectors_
attribute:
# clf.support_vectors_
Run the cell below to create your plot again, but with highlighted support vectors.
# plt.scatter(X1, X2, c = y)
# axes = plt.gca()
# axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
# axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue')
# plt.show()
The previous example was pretty easy. The 2 "clusters" were easily separable by one straight line classifying every single instance correctly. But what if this isn't the case? Let's have a look at the second dataset we had generated:
Run the cell below to recreate and plot our second dataset from above.
X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=2.8, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)
In the cell below, repeat the entire process from above. We're doing the exact same thing as we did above, but to a different dataset--feel free to copy and paste the code you wrote above.
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)
from sklearn import svm
clf = None
# clf.fit(None, None)
X1= None
X2= None
X1_min, X1_max = None, None
X2_min, X2_max = None, None
x1_coord = None
x2_coord = None
X2_C = None
X1_C = None
x1x2 = None
# df = clf.decision_function(x1x2).reshape(X1_C.shape)
# plt.scatter(X1, X2, c = y)
# axes = plt.gca()
# axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
# plt.show()
As you can see, 3 instances are misclassified (1 yellow, 2 purple). The reason for this is that in scikit learn, the svm module automatically allows for slack variables. If we want to make sure we have as few misclassifications as possible, we should set a bigger value for C, the regularization parameter.
Now, we run the same code again, except with a different value for the C
parameter passed in at initialization for our svc
object. Run the cell below and see how our decision boundaries change.
X, y = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=2.8, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)
from sklearn import svm
clf = svm.SVC(kernel='linear', C=1000000)
clf.fit(X, y)
# Start Reusable Section
X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1
x1_coord = np.linspace(X1_min, X1_max, 10)
x2_coord = np.linspace(X2_min, X2_max, 10)
X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)
x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]
df = clf.decision_function(x1x2).reshape(X1_C.shape)
plt.scatter(X1, X2, c = y)
axes = plt.gca()
axes.contour(X1_C, X2_C, df, colors= "black", levels= [-1, 0, 1], linestyles=[':', '-', ':'])
plt.show()
# End Resuable Section
We'll now repeat the same process as above, but on our 3rd dataset. This dataset contains classes, turning this from a Binary Classification to a Multiclass Classification problem.
Run the cell below to recreate and plot the 3rd dataset we created at the beginning of this lab.
X, y = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=0.5, random_state = 123)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)
Now, we'll repeat the same process as we did above. In the cell below:
- Create a
SVC
object. Set thekernel
to"linear"
, andC
to20
. fit
the model to toX
andy
.
clf = None
# clf.fit(None, None)
Now, run the cell below to plot the decision boundaries for our multiclass dataset.
X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1
x1_coord = np.linspace(X1_min, X1_max, 200)
x2_coord = np.linspace(X2_min, X2_max, 200)
X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)
x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]
# Z = clf.predict(x1x2).reshape(X1_C.shape)
# axes = plt.gca()
# axes.contourf(X1_C, X2_C, Z, alpha = 1)
# plt.scatter(X1, X2, c = y, edgecolors = 'k')
# axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue', edgecolors= 'k')
# plt.show()
It would probably be nicer to have non-linear decision boundaries here, let's have a look at that! You can also see how your support vectors are changing.
Run the cell below. Notice that the only substantial change is to the kernel
parameter--here, we have changed it from linear
to rbf
(for "Radial Basis Function").
X, y = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=0.5, random_state = 123)
clf = svm.SVC(kernel='rbf', C=20)
clf.fit(X, y)
X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1
x1_coord = np.linspace(X1_min, X1_max, 500)
x2_coord = np.linspace(X2_min, X2_max, 500)
X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)
x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]
Z = clf.predict(x1x2).reshape(X1_C.shape)
axes = plt.gca()
axes.contourf(X1_C, X2_C, Z, alpha = 1)
plt.scatter(X1, X2, c = y, edgecolors = 'k')
axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue', edgecolors= 'k')
plt.show()
Let's see one more example of using the kernel trick to find non-linear decision boundaries. Run the cell below to create another sample dataset, fit an SVM using a non-linear kernel, and plot the decision boundaries.
As we did in previous examples, we have highlighted our support vectors in blue.
X, y = make_moons(n_samples=100, shuffle = False , noise = 0.3, random_state=123)
clf = svm.SVC(kernel='rbf', C=20)
clf.fit(X, y)
X1= X[:,0]
X2= X[:,1]
X1_min, X1_max = X1.min() - 1, X1.max() + 1
X2_min, X2_max = X2.min() - 1, X2.max() + 1
x1_coord = np.linspace(X1_min, X1_max, 500)
x2_coord = np.linspace(X2_min, X2_max, 500)
X2_C, X1_C = np.meshgrid(x2_coord, x1_coord)
x1x2 = np.c_[X1_C.ravel(), X2_C.ravel()]
Z = clf.predict(x1x2).reshape(X1_C.shape)
axes = plt.gca()
axes.contourf(X1_C, X2_C, Z, alpha = 1)
plt.scatter(X1, X2, c = y, edgecolors = 'k')
axes.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], facecolors='blue', edgecolors= 'k')
plt.show()
https://machinelearningmastery.com/generate-test-datasets-python-scikit-learn/
http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html
http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html