We have a tabular data that consists of ๐ observations, each observation represents a row in the table of data. Column-wise, the data consists of ๐+1 columns. ๐ of them are independent variables โfeaturesโ x1โ, x2โ, โฆ, xkโ and the last one is a โresponseโ variable ๐ฆ that its values are continous. The following figure, shows an example of how the data looks.
The goal is to use the data at hand to build a model that approximate the relationship between the dependent variable y and the set of the independent variables X={x1โ,x2โ,x3โ,...,xkโ}.
Multiple linear regression as an extention of the simple Linear regression assumes a linear relationship between the dependent variable y and the set of the independent variables X and in light of this assumption, y could be written as a linear combination of xiโโs, where i=1,2,3,...,k.
ฮต is called the irreducable error, because no matter how close the model we build to the true relationship, still there will be a small margin of error.
y=XB+E
A fundamental difference between the single version and the multiple version of the linear regression models is, simple linear regression assumes true linearity between y and x, modeling a relationship between one independent variable and one dependent variable as a real line y=mx+c.
On the other hand, the linearity in multiple linear regression is only required for the coeffcients ฮฒiโโs, i=1,2,3,...,k and the relationship between y and xiโโs in the set X could be non-linear. The following examples show valid and invalid variants of multiple linear regression models.
The erriducable error ฮต can not be estimated. Therefore our approximated model will ignore it and we will infer only the values of ฮฒiโโs. Our model is going to be in the following form y^โ=ฮฒ0โ+ฮฒ1โx1โ+ฮฒ2โx2โ+ฮฒ3โx3โ+...+ฮฒkโxkโ
Letโs say we have already calculated the ฮฒiโโs , we can then calculate yiโ^โ for each observation in our data. We can write this approximation in a matrix notation as follows.
We want our approximated model to be as close as possible to the true relationship that generated the data. Therefore we want to see that the difference between the observed value of the dependent variable and its approximated value, goes to zero. yโy^โโ0.