Support Vector Machines — from Theory to Implementation

9 min readApr 28, 2021

Support Vector Machine (SVM) is a statistical learning method that was developed in the computer science community in the 1990s. SVM is majorly used for classification purposes, however, there also exists an extension of SVM for regression. SVMs are considered as “out-of-the-box” classifiers as they perform well on a variety of settings. This article attempts to introduce the basic concepts of SVMs through following stepwise approach:

Concept of Hyperplane
Classification using a Separating Hyperplane
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machines
Implementing SVM via Sci-kit Learn library of Python

This article is targeted to those who are just beginning to grasp the concepts of SVM algorithm or have intermediate understanding but want to brush up the fundamental concepts.

Concept of Hyperplane

Definition: A hyperplane is a flat subspace in vector space whose dimension is one less than that of the vector space.

For example:
- In a 2-dimensional space, a hyperplane is a line (flat 1-dimensional subspace).
- In a 3-dimensional space, a hyperplane is a plane (flat 2-dimensional subspace)

Mathematical Definition of a Hyperplane:

where B₀, B₁,…..,Bₚ are parameters and X=(X₁, X₂,….Xₚ) ᵀ is any point on the hyperplane.

Now, if the point X does not satisfy above equation, there can be two possibilities:

Case 1: Point X lies on one side of the hyperplane

Case 2: Point X lies on the other side of the hyperplane

Observations can lie on either side of the hyperplane

Classification using a Separating Hyperplane

Based on the above two cases, one can use a hyperplane for binary classification and hence, the hyperplane is called a Separating Hyperplane.
Suppose we have n X p data matrix X containing n training observations in a p-dimensional space:

These observations will fall into one of the two classes : -1 and 1.
We also have a test observation that we want to classify into one of the above two classes: -

In such a case, we can use the separating hyperplane to classify our test observation.
One such possible classifier is shown below, where we can assign a class to our test observation based on which side of hyperplane it is located.

Classification using Separating Hyperplane

Also, in the above classifier, if the test point is far from the separating hyperplane, we can be more confident about our class assignment and vice-versa.

Maximal Margin Classifier

Maximal Margin Classifier is the classifier constructed based on the separating hyperplane.
To construct such a classifier, we have to choose a hyperplane that is farthest from the training observations.
Margin: It is the smallest (or perpendicular) distance of a data point from the separating hyperplane.
The separating hyperplane for which the margin is largest (i.e. hyperplane has the farthest minimum distance to the training observations) is called the Maximal Margin Hyperplane (or Optimum Separating Hyperplane) and the resulting classifier is called Maximal Margin Classifier.
We can now classify a test observation based on which side of the maximal margin classifier it lies.

Support Vectors: In above figure, 3 training observations are equidistant from the maximal margin hyperplane (2 from blue class and 1 from purple class). These observations lie on the margin and are called Support Vectors, because they provide support to the maximal margin hyperplane in the sense that if they are moved slightly, the maximal margin hyperplane will also move.
Important point to note: The maximal margin hyperplane depends only on the support vectors. If any other observations are moved, they will not cross the hyperplane boundary. Thus, the classifier prediction is not affected by change in observations other than support vectors.
Optimization Problem of Maximal Margin Classifier:
(aka the Cost Function)

where M = width of margin
x₁, x₂,…..,xₙ = n training observations
y₁, y₂,…..,yₙ = associated class labels (having one of the two class: -1 and 1)

The constraint M in the above equation guarantees that each observation will be on correct side of the hyperplane, provided that M is positive.
Overall, we choose B₀, B₁,…..,Bₚ in order to maximize M.
Drawbacks of Maximal Margin Classifier:

Such a classifier perfectly classifies all the observations, and hence, may be extremely sensitive to change in individual observations. In other words, this classifier may overfit the data.
In some cases (like the one in below figure), we cannot separate the two classes as our margin is a Hard Margin, i.e. allows no flexibility.

Maximal Margin Classifier won’t work in this case

For such cases, we can relax our margin and create a Soft Margin that will give us a hyperplane that almost separates the classes. Such a classifier is called Soft Margin Classifier or Support Vector Classifier.

Support Vector Classifier

Support Vector Classifier, like the Maximal Margin Classifier, classifies a test observation depending on which side of the hyperplane it lies.
However, in this type of classifier we choose a hyperplane such that it allows for a few misclassification of observations.
Here, the margin is called a Soft Margin as it can be violated by some of the observations.

In the above Support Vector Classifier, a few violations are allowed by the soft margin — observations 1, 2, 7, 8, 9, 11 and 12. Out of these, observations 1 and 8 are on the wrong side of margin, whereas observations 11 and 12 are on the wrong side of hyperplane.
The difference between Hard Margin and Soft Margin is apparent from the below figure: -

Difference between Hard Margin & Soft Margin

Optimization Problem of Support Vector Classifier:

where, C= a non-negative tuning parameter
M = width of margin
Є₁,Є₂,…..,Єₙ = Slack variables (these tell us the location of iᵗʰ observation)
> If Єᵢ = 0, then iᵗʰ observation lies on the correct side of margin
> If Єᵢ > 0, then iᵗʰ observation lies on the wrong side of margin
> If Єᵢ > 1, then iᵗʰ observation lies on the wrong side of hyperplane

Interpretation of tuning parameter C: From above equation, we see that C bounds the sum of Єᵢ’s. So, it determines the number and severity of violations to the margin (and to the hyperplane) that the classifier will tolerate.
> As C increases, the margin widens
> As C decreases, the margin narrows

Value of C is decreased from figure on top left to figure on bottom right

> A classifier with small of value of C has low bias and high variance, because narrow margins are rarely violated and thus classifier is highly fit to data.
> In contrast a classifier with high value of C has high bias and low variance, because wider margins lead to more violations and less hard fit.

As a Support Vector Classifier’s decision rule is based only on a small number of training observations (called Support Vectors), this classifier is robust in behavior of observations that are far away from the hyperplane.
Disadvantage of Support Vector Classifier: They don’t perform well when we have non-linear class boundaries as shown below: -

Support Vector Classifier performs poorly in this case

This could be overcome by creating Support Vector Machines.

Support Vector Machines

Support Vector Machines (SVMs) is an extension of Support Vector Classifier that results from enlarging the feature space in a specific way, using kernels.
Enlarging the feature space helps to accommodate non-linear boundaries between classes. For example, instead of fitting a Support Vector Classifier using p features, we could fit it using 2p features X₁, X₁², X₂, X₂²,…., Xₚ, Xₚ². The optimization function will now become:

Other possible ways of enlarging the feature space are:-

using higher-order polynomial terms
using interaction terms of the form Xⱼ * Xⱼ’ for j ≠ j’
using other functions of predictors

Kernel: A kernel is a function that quantifies the similarity of two observations. For example, a Linear Kernel given in the below equation quantifies the similarity of a pair of observations using Pearson correlation:

Kernel approach for enlarging the feature space is an efficient computational approach, as it reduces the problem of evaluating the linear classifier to just calculating inner products.
Two types of non-linear Kernels are majorly used in SVMs: Polynomial kernels and Radial kernels.

When a Support Vector Classifier is combined with a non-linear kernel, the resulting classifier is known as a Support Vector Machine.

Polynomial Kernel: A polynomial kernel of degree d can be given by: -

where using d > 1 gives much more flexible decision boundary.

SVM with a Polynomial Kernel of degree 3

Note that a Support Vector Classifier is nothing but an SVM using a polynomial kernel of degree d = 1.
Radial Kernel: A radial kernel (also called Radial Basis Function(RBF))is also a popular non-linear kernel used in SVM. It takes the form: -

where γ is a positive constant.

Implementing SVM via Sci-kit Learn library of Python

Now that we have understood SVMs and their working, let us take a data set and create a classifier out of it using SVM.
Here we will generate a random binary classification problem using Sci-kit learn’s make_classification().

So, here we have to assign any given observation to one of the two class labels : 1 and 0.
Next, we split our dataset into train and test sets.

Now, we instantiate a Support Vector Classifier and fit our training set to it.

Note that SVC() takes radial kernel (rbf) by default. To use a polynomial kernel, just set the kernel parameter as kernel = ‘poly’.
After that, we make predictions on our test set.

Finally, we evaluate the performance of our classifier using various metrics and confusion matrix.

Thus, we have made predictions using our Support Vector Classifier with an accuracy of 88%.

This was a broad overview of the basic concepts of Support Vector Machines. If you want to study it more deeply, please go through the below references:-

References

Connect with me on:

LinkedIn: www.linkedin.com/in/praveerparmar7
Kaggle: https://www.kaggle.com/praveersinhparmar/

Support Vector Machines — from Theory to Implementation

Written by Praveersinh Parmar