Mathematical Formulation of Support Vector Classifier SVC

Written by Ahmed Mirghani on πŸ“… March 16, 2024

πŸ“– ~ 8 min read

Image of Post - Mathematical Formulation of Support Vector Classifier SVC

Introduction

Support Vector Classifier (SVC) is a non-parametric classification algorithm that is based on geometric representation of linearly seperable data. In this article we are going to go through the development of the SVC algorithm step by step. Firstly, Because of the geometric background of SVC, let’s start with data that has only 2 features, therefore it can be rerresented in 2D, which makes visualiztion much easier and thereafter all the concepts in 2D will generalize well for higher dimensions.

The shape of data

Let’s say we have tabular data in the following form.

SVM data

Where each row in the table consists of two components, the first component is a point XiX_{i} in the space R2R^{2} such that Xi=(xi1,xi2)X_{i} = (x_{i1}, x_{i2}) and the second component is the group to which XiX_{i} belongs yi∈{βˆ’1,1}y_{i} \in \{-1, 1\}.
The data when ploted in 2D should be linearly seperable, and it would look something like in the following plot.

Data scatter

Geometrical Data classification

The goal of SVC is to figure out a line that separate the two groups of data in a best way. We will elaborate more on how this best way works in the upcoming section.

Line equation

It is important to note that the line equation used to describe the seperation line that we are after is not the commonly used slope-intercept form (y=mx+cy = mx+c). The reason is, the slope-intercept form fails to describe the vertical line, which might be a candidate for a seperation line in SVC. Let’s take an example.
L1L_{1} is a vertical line at x=ax=a, now if we took two points on L1L_{1}, p0=(a,y1)p_{0}=(a, y_{1}) and p1=(a,y2)p_{1}=(a, y_{2}), then:

m=y2βˆ’y1aβˆ’a=y2βˆ’y10=∞m = \frac{y_2-y_1}{a-a} = \frac{y_2-y_1}{0} = \infty

Therefore, we use the general form of line-equation instead, Ax+By+C=0Ax + By + C = 0 that can describe a verticle line at x=ax=a by setting A=1A=1, B=0B=0 and C=aC=a. The general form has another helpful characteristic, it tell if a point fall above or under a line. e.g. if LL is a line that has the following general form a1x+b1y=c1a_{1}x+b_{1}y = c_{1} and p=(x1,y1)p=(x_{1}, y_{1}) is a point, then by plugging pp in LL we get a value v=a1x1+b1y1βˆ’c1v = a_{1}x_{1}+b_{1}y_{1}-c_{1}, v>0v > 0 means the point pp is above the line LL, v<0v < 0 means pp is below LL, finally, v=0v = 0 means pp is on LL.
Now let’s redraw the previous figure, adding a line between the two clusters and label the points in the cluster above the line as (+) and the points in the cluster below the line as (-). The line could be expressed in a vector notation as wxβˆ’b=0wx-b=0.

SVM line plot

The methodology to choose the best separation line

How to determine the ww and bb that result in the best separation line, among many line candidates?

SVM Line candidates

The approach of SVC to choose the best seperation line is to find two parallel lines that each of them goes through at least one edge point of each group, and the best pair of lines is the one that has the largest distance between them β€œThis distance is called the margin”, then the seperation line is situated in the middle between these two parallel lines and parallel to them. This approach is called the widest street approach and the edge points of each clusters that the parallel lines go through are called support vectors.

support vectors

Margin width formulation

Now we want to formulate the margin width so we can use the assumed formula for maximizing the distance between the two data groups. To get to this point, we go through the following steps.
1- The line "wxβˆ’b=0wx-b=0" could be written as "wx=0wx=0" by including 11 in the vector xx and βˆ’b-b in the vector ww.

wxβˆ’b=[w1w2βˆ’b][x1x21]=wx=0\large \begin{split} wx-b = \begin{bmatrix} w_{1} & w_{2} & -b \end{bmatrix} \begin{bmatrix} x_{1}\\ x_{2}\\ 1\\ \end{bmatrix} = wx = 0 \end{split}

wx=0β‡’wwx=0 \Rightarrow \qquad w is perpendicular to xx

Vector w perpendicular to wx-b=0

Now we can calculate the distance from the line wxβˆ’b=0wx-b=0 to the line wxβˆ’b=1wx-b=1 or the line wxβˆ’b=βˆ’1wx-b=-1 as multiples of the unit vector of ww, starting from a point xβˆ—x_* on the line wxβˆ’b=0wx-b=0. Let’s calculate the distance from the line wxβˆ’b=0wx-b=0 to the line wxβˆ’b=1wx-b=1 and denote this distance as kk. Therefore the corresponding point x+x_+ on the line wxβˆ’b=1wx-b=1 is
x+=xβˆ—+kw∣∣w∣∣,margin=2kx_+ = x_* + k \frac{w}{||w||}, \qquad \text{margin}=2k

∡x+\because x_{+} is on wxβˆ’b=1wx-b=1

β‡’w(xβˆ—+kw∣∣w∣∣)βˆ’b=1\Rightarrow \qquad w(x_{*} + k \frac{w}{||w||}) - b = 1

wxβˆ—+kww∣∣wβˆ£βˆ£βˆ’b=1\qquad \qquad wx_{*} + k \frac{ww}{||w||} - b = 1

wxβˆ—βˆ’b+kww∣∣w∣∣=1\qquad \qquad wx_{*} - b + k \frac{ww}{||w||} = 1

∡xβˆ—\because x_{*} is on the line wxβˆ’b=0wx-b=0 β†’wxβˆ—βˆ’b=0\rightarrow wx_{*} - b = 0

β‡’wxβˆ—βˆ’b+kww∣∣w∣∣=kww∣∣w∣∣=1\Rightarrow \qquad wx_{*} - b + k \frac{ww}{||w||} = k \frac{ww}{||w||} = 1

∴k=∣∣w∣∣ww=∣∣w∣∣∣∣w∣∣2=1∣∣w∣∣\qquad \therefore k = \frac{||w||}{ww} = \frac{||w||}{||w||^{2}} = \frac{1}{||w||}

k=1∣∣w∣∣,margin=2k=2∣∣w∣∣k = \frac{1}{||w||}, \qquad \text{margin} = 2k = \frac{2}{||w||}

SVM Full picture

Margin width maximization

Now we have a measure for the margin and we have to maximize it as the SVC approach instructs.

maximize(2∣∣w∣∣)s.t\text{maximize}( \frac{2}{||w||}) \quad \text{s.t}

wxβˆ’bβ‰₯1ifΒ yi=1,wxβˆ’bβ‰€βˆ’1ifΒ yi=βˆ’1wx-b \geq 1 \quad \text{if} \ y_{i}=1, \quad wx-b \leq -1 \quad \text{if} \ y_{i}=-1

we can merge the two constraints into one as follows
wxβˆ’bβ‰₯1ifΒ yi=1,wxβˆ’bβ‰€βˆ’1ifΒ yi=βˆ’1wx-b \geq 1 \quad \text{if} \ y_{i}=1, \quad wx-b \leq -1 \quad \text{if} \ y_{i}=-1
β‡’yi(wxiβˆ’b)β‰₯1Β ,yi∈{1,βˆ’1}\Rightarrow \qquad \quad y_{i}(wx_{i}-b) \geq 1 \ , \qquad y_{i} \in \{ 1, -1 \}

β‡’maximize(2∣∣w∣∣)Β ,s.tyi(wxiβˆ’b)β‰₯1\large \Rightarrow \text{maximize}( \frac{2}{||w||}) \ , \qquad \text{s.t} \qquad y_{i}(wx_{i}-b) \geq 1

maximize(2∣∣w∣∣)≑minimize(∣∣w∣∣)\text{maximize}( \frac{2}{||w||}) \equiv \text{minimize}(||w||)

minimize(∣∣w∣∣)≑minimize(12∣∣w∣∣2)\text{minimize}(||w||) \equiv \text{minimize}(\frac{1}{2}||w||^2)

we choose to minimize 12∣∣w∣∣2\frac{1}{2}||w||^2 for mathematical convenience.

The Lagrangian multipliers

The Lagrangian multipliers method is used to include constraints along with the minimization problem into a Cost function.
The general form of the Lagrangian multipliers method is, we have a minimization problem minimize(f(w)) s.t g(w) ≀\leq 0, h(w) = 0, Then:

J=f(w)+βˆ‘i=1nΞ±igi(w)+βˆ‘i=1lΞ²ihi(w)J = f(w) + \sum^{n}_{i=1} \alpha_{i}g_{i}(w) + \sum^{l}_{i=1} \beta_{i}h_{i}(w)

∡yi(wxiβˆ’b)β‰₯1β†’1βˆ’yi(wxiβˆ’b)≀0\because y_{i}(wx_{i}-b) \geq 1 \rightarrow \quad 1-y_{i}(wx_{i}-b) \leq0
β†’gi(w)=1βˆ’yi(wxi+b)Β ,f(w)=12∣∣w∣∣2\rightarrow g_{i}(w)=1-y_{i}(wx_{i}+b)\ , \quad f(w)=\frac{1}{2}||w||^{2}

we don’t have equality constraint Β β†’βˆ‘i=1lΞ²ihi(w)=0\ \rightarrow \quad \sum^{l}_{i=1} \beta_{i}h_{i}(w)=0

β‡’J=12∣∣w∣∣2+βˆ‘i=1nΞ±i{1βˆ’yi(wxiβˆ’b)} s.tΞ±iβ‰₯0\Rightarrow J = \frac{1}{2}||w||^{2} + \sum^{n}_{i=1} \alpha_{i}\{1 - y_{i}(wx_{i}-b)\} \, \qquad s.t \quad \alpha_{i} \geq 0

=12∣∣w∣∣2βˆ’βˆ‘i=1nΞ±i{yi(wxiβˆ’b)βˆ’1} s.tΞ±iβ‰₯0\large \qquad \quad = \frac{1}{2}||w||^{2} - \sum^{n}_{i=1} \alpha_{i}\{y_{i}(wx_{i}-b) - 1 \} \, \qquad s.t \quad \alpha_{i} \geq 0

from a Lemma

maxΞ±Β J(w,b,Ξ±)={f(w)Β ,statisfiesΒ allΒ theΒ constraints∞ ,otherwise\qquad max_{\alpha} \ J(w, b, \alpha)= \begin{cases} f(w) \ , &\qquad \text{statisfies all the constraints}\\ \infty \ , & \qquad \text{otherwise}\\ \end{cases} β‡’maxΞ±Β J(w,b,Ξ±)={12∣∣w∣∣2Β ,1βˆ’yi(wxiβˆ’b)≀0βˆ€i=1,...,n,∞ ,otherwise\Rightarrow max_{\alpha} \ J(w, b, \alpha)= \begin{cases} \frac{1}{2}||w||^{2} \ , &\qquad 1-y_{i}(wx_{i}-b) \leq 0 \qquad \forall i=1,...,n,\\ \infty \ , & \qquad \text{otherwise}\\ \end{cases}

β‡’\Rightarrow \qquad The primal problem JJ can be expressed as minimizew,b{Β maximizeΞ±{Β J(w,b,Ξ±)}Β }minimize_{w, b} \{\ maximize_{\alpha} \{\ J(w, b, \alpha) \}\ \}

The dual problem formulation

Let pβˆ—p^{*} be solution for minw,bΒ maxΞ±J(w,b,Ξ±)min_{w, b} \ max_{\alpha} J(w, b, \alpha) , there is an option to minimize JJ first, then maximize it, maxΞ±Β minw,bJ(w,b,Ξ±)max_{\alpha} \ min_{w, b} J(w, b, \alpha), this order flip turn the primal lagrangian problem into what is called the dual problem.

let dβˆ—d^{*} be a solution for maxΞ±Β minw,bJ(w,b,Ξ±)max_{\alpha} \ min_{w, b} J(w, b, \alpha)

Finding the solution dβˆ—d^{*} is based on two theorms, the weak duality and the strong duality theorms.

The weak duality theorm implies that dβˆ—β‰€pβˆ—d^{*} \leq p^{*}.

The strong duality theorm implies that iff there exists a saddle point of J(w,b,Ξ±)⇔dβˆ—=Pβˆ—J(w, b, \alpha) \Leftrightarrow d^{*}=P^{*} and this saddle point satisfies the Karush Kuhn Tucker KKT condition which consists of the following terms

  • βˆ‚J(w,b,Ξ±)βˆ‚wi=0Β ,βˆ€Β i=1,2,...,k\frac{\partial J(w, b, \alpha)}{\partial w_{i}} = 0 \ , \quad \forall \ i=1, 2, ..., k

  • Ξ±igi(w)=0Β ,βˆ€Β i=1,2,...,n\alpha_{i}g_{i}(w) = 0 \ , \quad \forall \ i=1, 2, ..., n

  • Ξ±iβ‰₯0Β ,βˆ€Β i=1,2,...,n\alpha_{i} \geq 0 \ , \quad \forall \ i=1, 2, ..., n

βˆ‚Jβˆ‚w=wβˆ’βˆ‘i=1nΞ±iyixi=0Β ,Β β‡’w=βˆ‘i=1nΞ±iyixi\frac{\partial J}{\partial w} = w - \sum^{n}_{i=1}\alpha_{i}y_{i}x_{i} = 0 \ , \ \quad \Rightarrow w = \sum^{n}_{i=1}\alpha_{i}y_{i}x_{i}

βˆ‚Jβˆ‚b=βˆ‘i=1nΞ±iyi=0\frac{\partial J}{\partial b} = \sum^{n}_{i=1}\alpha_{i}y_{i} = 0

Let’s now expand the function JJ and substitute the terms we got from the derivatives above.

J=12∣∣w∣∣2βˆ’βˆ‘i=1nΞ±i{yi(wxiβˆ’b)βˆ’1}J = \frac{1}{2}||w||^{2} - \sum^{n}_{i=1} \alpha_{i}\{y_{i}(wx_{i}-b) - 1 \}

=12w.wβˆ’βˆ‘Ξ±iyiwxi+bβˆ‘Ξ±iyi+βˆ‘Ξ±i\quad = \frac{1}{2}w.w - \sum\alpha_{i}y_{i}wx_{i} + b\sum\alpha_{i}y_{i} + \sum \alpha_{i}

βˆ΅βˆ‘i=1nΞ±iyi=0Β β‡’bβˆ‘Ξ±iyi=0\because \sum^{n}_{i=1}\alpha_{i}y_{i} = 0 \ \Rightarrow \qquad b\sum\alpha_{i}y_{i}=0

∴J=12w.wβˆ’βˆ‘Ξ±iyiwxi+βˆ‘Ξ±i\therefore J = \frac{1}{2}w.w - \sum\alpha_{i}y_{i}wx_{i} + \sum \alpha_{i}

By substituting w=βˆ‘i=1nΞ±iyixiw = \sum^{n}_{i=1}\alpha_{i}y_{i}x_{i} in J\large J, we get

β‡’J=12βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTx_jβˆ’βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTxj+βˆ‘Ξ±i\Rightarrow J = \frac{1}{2}\sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x\_{j} - \sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x_{j} + \sum \alpha_{i}

Β =βˆ‘Ξ±iβˆ’12βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTx_j\qquad \ = \sum \alpha_{i} - \frac{1}{2}\sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x\_{j}

Now the problem boils down to

minimize{Β maximizeΒ J(Ξ±)=βˆ‘Ξ±iβˆ’12βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTxj)Β },minimize \{ \ maximize \ J(\alpha)= \sum \alpha_{i} - \frac{1}{2}\sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x_{j}) \ \} , \quad

s.t
Ξ±iβ‰₯0Β ,Β βˆ€i=1,2,...,rwhereΒ r<n\quad \alpha_{i} \geq 0 \ , \ \forall i = 1, 2, ..., r \qquad where \ r<n, βˆ‘Ξ±iyi=0\quad \sum\alpha_{i}y_{i} = 0

We can turn this maximization problem into a minimization problem by minimizing the negative of JJ as follows

minimize{Β J(Ξ±)=12βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTxjβˆ’βˆ‘Ξ±i}minimize \{\ J(\alpha)= \frac{1}{2}\sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x_{j} - \sum \alpha_{i} \}

Conclusion

we can get a solution for the problem above by using the Gradient Descent method or using quadratic programming.

The term βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTxj\sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x_{j} could be represented in matrix notation as follows.

βˆ‘βˆ‘Ξ±iΞ±jyiyjxiTxj=Ξ±THΞ±Β ,\sum\sum \alpha_{i}\alpha_{j}y_{i}y_{j}x^{T}_{i}x_{j} = \alpha^{T}H\alpha \ , \qquad Where H=yyTβŠ™XXTH = yy^{T} \odot XX^{T}

βŠ™\odot is an element-wise product.

minimizeΞ±(J(Ξ±)=12Ξ±THΞ±βˆ’ITΞ±)minimize_{\alpha}(J(\alpha) = \frac{1}{2}\alpha^{T}H\alpha - I^{T}\alpha)

βˆ‚Jβˆ‚Ξ±=HΞ±βˆ’I\frac{\partial J}{\partial \alpha} = H\alpha - I

Comments

✦ AiMirghani

Β© 2026 - All Rights Reserved

Email Instagram Telegram GitHub