Derevation of the closed form solution of Multiple Linear Regression.

Written by Ahmed Mirghani on ๐Ÿ“… July 10, 2023

๐Ÿ“– ~ 5 min read

Image of Post - Derevation of the closed form solution of Multiple Linear Regression.

Problem formulation

We have a tabular data that consists of ๐‘› observations, each observation represents a row in the table of data. Column-wise, the data consists of ๐‘˜+1 columns. ๐‘˜ of them are independent variables โ€œfeaturesโ€ ๐‘ฅ1๐‘ฅ_1, ๐‘ฅ2๐‘ฅ_2, โ€ฆ, ๐‘ฅ๐‘˜๐‘ฅ_๐‘˜ and the last one is a โ€œresponseโ€ variable ๐‘ฆ that its values are continous. The following figure, shows an example of how the data looks.

Tabular Data

The goal is to use the data at hand to build a model that approximate the relationship between the dependent variable yy and the set of the independent variables X={x1,x2,x3,...,xk}X= \{x_{1}, x_{2}, x_{3}, ..., x_{k} \}.

Multiple linear regression as an extention of the simple Linear regression assumes a linear relationship between the dependent variable yy and the set of the independent variables XX and in light of this assumption, yy could be written as a linear combination of xix_{i}โ€˜s, where i=1,2,3,...,ki = 1,2,3, ..., k.

y=ฮฒ0+ฮฒ1x1+ฮฒ2x2+ฮฒ3x3+...+ฮฒkxk+ฮตy = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + ... + \beta_{k}x_{k} + \varepsilon [y1y2โ‹ฎyn]=[1x11x12x13โ‹ฏx1k1x21x22x23โ‹ฏx2kโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ1xn1xn2xn3...xnk][ฮฒ0ฮฒ1ฮฒ2โ‹ฎฮฒk]+[ฮต1ฮต2โ‹ฎฮตn]\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots\\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13} & \cdots & x_{1k}\\ 1 & x_{21} & x_{22} & x_{23} & \cdots & x_{2k}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{n1} & x_{n2} & x_{n3} & ... & x_{nk}\\ \end{bmatrix} \begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \beta_{2}\\ \vdots\\ \beta_{k} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{n} \end{bmatrix}

ฮต\varepsilon is called the irreducable error, because no matter how close the model we build to the true relationship, still there will be a small margin of error. y=XB+E y = XB + E

A fundamental difference between the single version and the multiple version of the linear regression models is, simple linear regression assumes true linearity between yy and xx, modeling a relationship between one independent variable and one dependent variable as a real line y=mx+cy = mx + c.
On the other hand, the linearity in multiple linear regression is only required for the coeffcients ฮฒi\beta_{i}โ€˜s, i=1,2,3,...,ki=1,2,3,...,k and the relationship between yy and xix_{i}โ€˜s in the set XX could be non-linear. The following examples show valid and invalid variants of multiple linear regression models.

  • y=ฮฒ0+ฮฒ1x1+ฮฒ2x2+ฮฒ3x3+ฮฒ4x12y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x^{2}_{1} โœ…
  • y=ฮฒ0+ฮฒ1x1+ฮฒ2x2+ฮฒ3x3+ฮฒ4x2x3y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x_{2}x_{3} โœ…
  • y=ฮฒ0+ฮฒ12x1+ฮฒ2x2+ฮฒ3x3+ฮฒ4x4y = \beta_{0} + \beta^{2}_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x_{4} โŒ
  • y=ฮฒ0+ฮฒ1x1+ฮฒ2x2+ฮฒ1ฮฒ2x3+ฮฒ4x3y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{1}\beta_{2}x_{3} + \beta_{4}x_{3} โŒ

Matrix Notation

The erriducable error ฮต\varepsilon can not be estimated. Therefore our approximated model will ignore it and we will infer only the values of ฮฒi\beta_{i}โ€˜s. Our model is going to be in the following form
y^=ฮฒ0+ฮฒ1x1+ฮฒ2x2+ฮฒ3x3+...+ฮฒkxk\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + ... + \beta_{k}x_{k}
Letโ€™s say we have already calculated the ฮฒi\beta_{i}โ€˜s , we can then calculate yi^\hat{y_{i}} for each observation in our data. We can write this approximation in a matrix notation as follows.

[1x11x12x13...x1k1x21x22x23...x2k........................1xn1xn2xn3...xnk][ฮฒ0ฮฒ1ฮฒ2...ฮฒk]=[y1^y2^...yn^]\begin{bmatrix} 1 & x_{11} & x_{12} & x_{13} & ... & x_{1k}\\ 1 & x_{21} & x_{22} & x_{23} & ... & x_{2k}\\ . & . & . & . & ... & . \\ . & . & . & . & ... & . \\ . & . & . & . & ... & . \\ 1 & x_{n1} & x_{n2} & x_{n3} & ... & x_{nk}\\ \end{bmatrix} \begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \beta_{2}\\ .\\ .\\ .\\ \beta_{k} \end{bmatrix} = \begin{bmatrix} \hat{ y_{1} }\\ \hat{ y_{2} }\\ .\\ .\\ .\\ \hat{ y_{n} } \end{bmatrix}
XB^=y^X\hat{B} = \hat{y}

Finding the optimal parameters

We want our approximated model to be as close as possible to the true relationship that generated the data. Therefore we want to see that the difference between the observed value of the dependent variable and its approximated value, goes to zero. yโˆ’y^โ†’0y - \hat{y} \rightarrow 0.

yโˆ’y^=[y1โˆ’y1^y2โˆ’y2^...ynโˆ’yn^]โ†’[00...0]y - \hat{y} = \begin{bmatrix} y_{1} - \hat{ y_{1} }\\ y_{2} - \hat{ y_{2} }\\ .\\ .\\ .\\ y_{n} - \hat{ y_{n} } \end{bmatrix} \rightarrow \begin{bmatrix} 0\\ 0\\ .\\ .\\ .\\ 0\\ \end{bmatrix}

In other form, the overall โˆ‘i=1n(yiโˆ’yi^)2โ†’0\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^{2} \rightarrow 0

Since y^=XB^\hat{y} = X\hat{B},

we can express โˆ‘i=1n(yiโˆ’yi^)2\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^{2} โ€ƒ as โ†ด

(yiโˆ’yi^)2ย =ย (yโˆ’y^)T(yโˆ’y^)(y_{i} - \hat{y_{i}})^2\ =\ (y - \hat{y})^T(y - \hat{y})

(yโˆ’y^)T(yโˆ’y^)=ย (yโˆ’XB^)T(yโˆ’XB^)(y - \hat{y} )^T(y - \hat{y} ) =\ (y - X\hat{B} )^T(y - X\hat{B} )

=ย (yTโˆ’(XB^)T)(yโˆ’XB^) =\ (y^{T} - (X\hat{B})^{T})(y - X\hat{B})

=ย (yTโˆ’B^TXT)(yโˆ’XB^) =\ (y^{T} - \hat{B}^{T}X^{T})(y - X\hat{B})

=yTyโˆ’yTXB^โˆ’B^TXTy+B^TXTXB^ = y^{T}y - y^{T}X\hat{B} - \hat{B}^{T}X^{T}y + \hat{B}^{T}X^{T}X\hat{B}

โˆ‚ย ((yโˆ’XB^)T(yโˆ’XB^))โˆ‚B^=0\frac{\partial\ ({(y - X\hat{B})^T(y - X\hat{B})}) }{\partial \hat{B}} = 0

โ‡’โˆ‚ย (yTyย โˆ’ย yTXB^ย โˆ’ย B^TXTyย +ย B^TXTXB^)โˆ‚B^=0\Rightarrow \quad \frac {\partial\ (y^{T}y\ -\ y^{T}X\hat{B}\ -\ \hat{B}^{T}X^{T}y\ +\ \hat{B}^{T}X^{T}X\hat{B})}{\partial \hat{B}} = 0

โ‡’โˆ‚yTyโˆ‚B^โˆ’โˆ‚yTXB^โˆ‚B^โˆ’โˆ‚B^TXTyโˆ‚B^+โˆ‚B^TXTXB^โˆ‚B^=0\Rightarrow \frac{\partial y^{T}y}{\partial \hat{B}} - \frac{\partial y^{T}X\hat{B}}{\partial \hat{B}} - \frac{\partial \hat{B}^{T}X^{T}y}{\partial \hat{B}} + \frac{\partial \hat{B}^{T}X^{T}X\hat{B}}{\partial \hat{B}} = 0

if we have a matrix Amร—mA_{m \times m} and a vector xmร—1x_{m \times 1}, then:

โˆ‚Axโˆ‚x=A,ย โˆ‚xTAโˆ‚x=AT,ย โˆ‚xTAxโˆ‚x=2ATx\large \frac{\partial Ax}{\partial x} = A, \ \frac{\partial x^{T}A}{\partial x} = A^{T}, \ \frac{\partial x^{T}Ax}{\partial x} = 2A^{T}x

โ‡’โˆ‚yTXB^โˆ‚B^=yTX\large \Rightarrow \qquad \frac{\partial y^{T}X\hat{B}}{\partial \hat{B}} = y^{T}X

โˆ‚B^TXTyโˆ‚B^=(XTy)T=yTX\large \qquad \quad \frac{\partial \hat{B}^{T}X^{T}y}{\partial \hat{B}} = (X^{T}y)^{T} = y^{T}X

โˆ‚B^TXTXB^โˆ‚B^=2XTXB^\large \qquad \quad \frac{\partial \hat{B}^{T}X^{T}X\hat{B}}{\partial \hat{B}} = 2X^{T}X\hat{B}

โˆ‚yTyโˆ‚B^=0\large \qquad \quad \frac{\partial y^{T}y}{\partial \hat{B}} = 0

โˆดโˆ’yTXโˆ’yTX+2XTXB^=0\therefore \quad -y^{T}X - y^{T}X + 2X^{T}X\hat{B} \quad = \quad 0

โ‡’2XTXB^=2yTX\large \Rightarrow \quad 2X^{T}X\hat{B} \quad = \quad 2y^{T}X

B^=(XTX)โˆ’1yTX\large \qquad \boxed{\hat{B} \quad = \quad (X^{T}X)^{-1}y^{T}X}

Comments

โœฆ AiMirghani

ยฉ 2026 - All Rights Reserved

Email Instagram Telegram GitHub