This blog tries to peek into the mathematical world of support vector machines(SVM) and is eminently driven from Late Prof. Patrick Winston’s video lecture about learning SVM.
Apart from the mathematical intricacy, the journey of SVM, right from the birth as an idea in Vladimir Vapnik’s mind to it’s implementation is extremely interesting.
What if at the very begining of writing about this serious subject, i compare Machine learning with something such as a Microvave? Definitely the data science community will feel wierd about it.
But In her ~22 mins Ted Talk, Cassie Kozyrkov Chief Decision Scientist at Google.Inc relates Machine learning with Microwaves. The point she wants to make is that why do we trust a Microwave even when we don’t know how to build a Microwave from scratch? Answer is that we take our decision to trust it or not based on what they do for us, checking whether the machines deliver what they need to deliver or not?
In a similar way, the main focus area of applied data science lies on making decisions for applying Machine learning algorithms to the real world problems.
It is evident that how decision making plays an important role on the most external side of Machine learning. Intrestingly, the internal side of ML world, that is the core idea of ML algorithms also revolves around decision boundaries. Simple ideas for doing the same are implemented using NN, KNN, Id trees etc. But,
as Prof. Winston suggests, SVM is a more sophasticated idea and needs to be in the tool bag of every civilized person.
SVM focuses on how to divide up the space using decision boundaries.
In order to understand this sophasticated idea of SVM, some fundamental knowledge of linear algebra and vector spaces is needed, this is provided in the first section of the blog.
Dot Product and Vector Projection:
A geometrical object that possess both magnitude as well as direction is termed as a Vector. As shown in Fig.1, tail end of the vector u is at origin(0,0) and the head is at (3,2). The magnitude of the vector is its length, denoted by ||u||and calculated by square rooting the sum of the squares of its components. The scalar projection is given by the dot product of a vector with a unit vector in that direction. From Fig.1 it is easy to interpret that the projection of u on v is (3,0) — Vector in Green.
A subspace with a dimension one less than that of its ambient space is known as hyperplane. It divides the given data points into two parts. For a 2-dimensional space, hyper plane will be 1D that is a line, in 3D space it would be a 2 dimensional plane.
One can divide the data points in pink and green, as shown in Fig.2. But that’s not the only way. As shown in Fig.3 there are multiple different ways of doing the same. Here begins the quest of finding the best solution, or so to say an optimum hyperplane to separate the given data points.
let us call the Green points as the positive samples and the pink points as the negative samples. The decision boundary lines shown in Fig.4 are drawn with a view that it is the widest street that separates positve samples from the negative samples. This is why it is also known as, “Widest street approach”.
Given the decision boundary, we need to create a decision rule that would use this decision boundary, to help classfiy the future samples correctly. In order to do that, let us assume any vector w of any random length but constrained to be perpendicular to the median line of the street.
Simillarly, let u be an unknown vector pointing towards an unknown sample that we are interested in knowing whether it is on the left hand side of the street of the right hand side of the street.
As discussed above, we know that the projection of vector u on w which is perpendicular to the street will be helpful to identify where the point is lying. The bigger the projection more are the chances of that point being on the positive side.
At this point w and b both are unknowns, only known information that we have is that the w is perpendicular to the median. Simple rule in mathematics is to put more constraints so that we can find the unknowns.
This constraints could be understood in this way that any positive sample will abide the decision function and have value 1 or greater than one, and vice versa for the negative samples. Thus the seperation distance here is from -1 to +1, for all the samples.
This way we not only got rid of two equations and got one simple equation, rather we also found one more constraint suggesting for all points lying on the Gutters (boundaries) the value of decision function will be Zero.
Now that we have enough constraints, let us remind ourselves the main purpose here is to have the street seperating the positive and negative samples “as wide as possible”.
Let’s find out how can we calculate the distance between both the boundaries, that is the width of the street.
As shown in Fig.6, The vector in Pink is pointing towards a negative sample and the vector in Green represents a positive sample. The vector in blue shows the difference between both the vectors. In order to find out the width, as we learned previously we can project the blue vector on w because it is the unit normal . (Remember the constraint for w was that it will always be perpendicular to the median). The magnitude of the Vector in Red will give us the Width of the Street.
Again, our goal is to maximize the width of the streeth, therefore:
Here we fall in a situation where we would like to maximize an expression but keeping in mind the constraints that needs to be fulfilled. Lagrange multipliers are useful in such situation as they provide us a new expression that could be maximized or minimzed without taking care of constraints anymore.
This is the final equation for which we still need to find the extremum, Although that is not in the scope of this blog, our interest still lies in looking at the factor on which the L depends. As it could be seen from the eq.9 the optimisation of the expression solely depends on the dot product of the pairs of sample vectors.
Finally, lets plugin the value of w that we found from the lagrangian into our decision rule. Here we again discover the same thing that the decision rule now depends only on the dot product of the sample vectors. Started with the dot products, ending again with the dot products.
For the linearly non seperable datapoints we have Kernel function that gives us the transformation of the dot products of the vectors in another space in order to find a decision boundary. The trick that makes a support vector classfier, a support vector machine.