Support Vector Classifier (SVC) is a non-parametric classification algorithm that is based on geometric representation of linearly seperable data. In this article we are going to go through the development of the SVC algorithm step by step. Firstly, Because of the geometric background of SVC, letβs start with data that has only 2 features, therefore it can be rerresented in 2D, which makes visualiztion much easier and thereafter all the concepts in 2D will generalize well for higher dimensions.
The shape of data
Letβs say we have tabular data in the following form.
Where each row in the table consists of two components, the first component is a point Xiβ in the space R2 such that Xiβ=(xi1β,xi2β) and the second component is the group to which Xiβ belongs yiββ{β1,1}.
The data when ploted in 2D should be linearly seperable, and it would look something like in the following plot.
Geometrical Data classification
The goal of SVC is to figure out a line that separate the two groups of data in a best way. We will elaborate more on how this best way works in the upcoming section.
Line equation
It is important to note that the line equation used to describe the seperation line that we are after is not the commonly used slope-intercept form (y=mx+c). The reason is, the slope-intercept form fails to describe the vertical line, which might be a candidate for a seperation line in SVC. Letβs take an example. L1β is a vertical line at x=a, now if we took two points on L1β, p0β=(a,y1β) and p1β=(a,y2β), then:
m=aβay2ββy1ββ=0y2ββy1ββ=β
Therefore, we use the general form of line-equation instead, Ax+By+C=0 that can describe a verticle line at x=a by setting A=1, B=0 and C=a. The general form has another helpful characteristic, it tell if a point fall above or under a line. e.g. if L is a line that has the following general form a1βx+b1βy=c1β and p=(x1β,y1β) is a point, then by plugging p in L we get a value v=a1βx1β+b1βy1ββc1β, v>0 means the point p is above the line L, v<0 means p is below L, finally, v=0 means p is on L.
Now letβs redraw the previous figure, adding a line between the two clusters and label the points in the cluster above the line as (+) and the points in the cluster below the line as (-). The line could be expressed in a vector notation as wxβb=0.
The methodology to choose the best separation line
How to determine the w and b that result in the best separation line, among many line candidates?
The approach of SVC to choose the best seperation line is to find two parallel lines that each of them goes through at least one edge point of each group, and the best pair of lines is the one that has the largest distance between them βThis distance is called the marginβ, then the seperation line is situated in the middle between these two parallel lines and parallel to them. This approach is called the widest street approach and the edge points of each clusters that the parallel lines go through are called support vectors.
Margin width formulation
Now we want to formulate the margin width so we can use the assumed formula for maximizing the distance between the two data groups. To get to this point, we go through the following steps.
1- The line "wxβb=0" could be written as "wx=0" by including 1 in the vector x and βb in the vector w.
Now we can calculate the distance from the line wxβb=0 to the line wxβb=1 or the line wxβb=β1 as multiples of the unit vector of w, starting from a point xββ on the line wxβb=0. Letβs calculate the distance from the line wxβb=0 to the line wxβb=1 and denote this distance as k. Therefore the corresponding point x+β on the line wxβb=1 is x+β=xββ+kβ£β£wβ£β£wβ,margin=2k
β΅x+β is on wxβb=1
βw(xββ+kβ£β£wβ£β£wβ)βb=1
wxββ+kβ£β£wβ£β£wwββb=1
wxβββb+kβ£β£wβ£β£wwβ=1
β΅xββ is on the line wxβb=0βwxβββb=0
we choose to minimize 21ββ£β£wβ£β£2 for mathematical convenience.
The Lagrangian multipliers
The Lagrangian multipliers method is used to include constraints along with the minimization problem into a Cost function.
The general form of the Lagrangian multipliers method is, we have a minimization problem minimize(f(w)) s.t g(w)β€ 0, h(w) = 0, Then:
β The primal problem J can be expressed as minimizew,bβ{Β maximizeΞ±β{Β J(w,b,Ξ±)}Β }
The dual problem formulation
Let pβ be solution for minw,bβΒ maxΞ±βJ(w,b,Ξ±)
, there is an option to minimize J first, then maximize it, maxΞ±βΒ minw,bβJ(w,b,Ξ±), this order flip turn the primal lagrangian problem into what is called the dual problem.
let dβ be a solution for maxΞ±βΒ minw,bβJ(w,b,Ξ±)
Finding the solution dβ is based on two theorms, the weak duality and the strong duality theorms.
The weak duality theorm implies that dββ€pβ.
The strong duality theorm implies that iff there exists a saddle point of J(w,b,Ξ±)βdβ=Pβ and this saddle point satisfies the Karush Kuhn Tucker KKT condition which consists of the following terms