Here I go...: Linear Regression - Part(1) [Machine Learning Perspective]

Motivation:

Suppose we want to predict the rent price of a room. Prediction requires some features or variables on the basis of which we are going to predict. We think for a while and maybe the rooms which are near to the road cost more. Similarly, the rooms which have shutters in them are expensive as such rooms are valuable from the commercial perspective. We also see that big rooms cost more than smaller rooms. And more and more...

We want to take a handful of those features and based on them, make some prediction regarding rent price. How could we do it? How could we model it? One of the techniques in modeling such rent price model is linear regression and we will discuss it here.

What is linear regression?

Linear regression is such a modeling method where the dependent variable(which is to be predicted) is given by some linear combination of independent variables plus some constant. This model assumes that dependent variable has linear relation to the independent variables. In machine learning terminology, the constant is called bias and the other coefficients which get multiplied by independent variables are called weights.

If the independent variables are given by a row vector $ \vec {x} $ and dependent variable given by variable $y$, the mathematical relation is given by:

$$ y = \sum _{i=1}^{n}{{w}_{i}{x}_{i}}+c \quad ... \quad (1)$$

where $w$'s are the weights and $c$ is the bias.

Credit: Wikipedia

How linear regression is achieved using machine learning?

Linear regression can be carried in the variety of ways and one of the ways is by using machine learning. Basically, what happens in machine learning implemented linear regression is that the cost function decreases iteratively. But for this, we should understand cost function first. The cost function is the sum of all the least square errors between the true value and predicted value(in case of linear regression).

Suppose we have $m$ points and the points are in $n$ dimension. ith point is given by the row vector, $${x}^{(i)} = ({x}_{1}^{i}\quad{x}_{2}^{i}\quad...\quad{x}_{j}^{i}\quad...\quad{x}_{n}^{i})$$
Then, the cost function is given as:

$$ J= \cfrac{1}{2m}\sum_{i=1}^{m}{\left({ \sum _{j=1}^{n}{{w}_{j}{x}_{j}^{i}}+c-{y}^{i}}\right)}^{2}$$

We reduce the cost iteratively using gradient descent algorithm and the iterative process optimizes the value of weight and biases such that the cost is minimum. The resulting weights and biases give the linear regression model for the problem.

If we want to have L2 regularization, we can write the cost function as:

$$ J= \cfrac{1}{2m}\sum_{i=1}^{m}{\left({ \sum _{j=1}^{n}{{w}_{j}{x}_{j}^{i}}+c-{y}^{i}}\right)}^{2}+\cfrac{\lambda}{2m}\sum _{j=1}^{n}{{w}_{j}^{2}}$$

But what is Gradient Descent Algorithm?

Gradient descent algorithm is the iterative algorithm where the cost function is decreased to around local minima by updating the weights and biases in each iteration. The weights and biases are updated by adding some ratio of negative of the gradients of cost to weights and biases. The formula is given by:

For each ${w}_{j}$,

$$ {w}_{j}={w}_{j}-\alpha \cfrac{\partial J}{\partial {w}_{j}}$$
And for bias $c$,
$$c=c-\alpha\cfrac{\partial J}{\partial c}$$

How gradient descent acts on our problem?

Now, let's see how gradient descent acts on our problem. At first, let's calculate the gradients. Then, it is straightforward to put the gradients into our gradient descent algorithm. I calculated the gradients and the relation is given as:

For each ${w}_{k}$,

$$ \cfrac{\partial J} {\partial {w}_{k}} = \cfrac{1}{m}\sum_{i=1}^{m}{\left({ \sum _{j=1}^{n}{{w}_{j}{x}_{j}^{i}}+c-{y}^{i}}\right)}{x}_{k}^{(i)}+\cfrac{\lambda {w}_{k}}{m} $$

For bias $c$,

$$ \cfrac{\partial J} {\partial c} = \cfrac{1}{m}\sum_{i=1}^{m}{\left({ \sum _{j=1}^{n}{{w}_{j}{x}_{j}^{i}}+c-{y}^{i}}\right)} $$

The iterations are carried out until the weights and biases converge to some value. If they are diverging the learning rate might be high so that we should look to decrease it. There are other advanced optimization techniques too where the weights and biases decay with iterations. Those advanced optimization techniques result in better training and I would hopefully cover them in upcoming articles.

Code Vectorization

We don't write the machine learning codes with numerous loops rather vectorize them and then run. Writing loopy codes would be very much inefficient so we make use of libraries suited for vectorized code implementation like NumPy, etc. We write the data, weights, biases, and everything in the matrix form and then perform matrix operations. The libraries and frameworks best suited for computational efficient matrix operations come into rescue for us.

Let's vectorize all our codes.
The cost function would be:
$$ J = \cfrac{sum(Xw+c-Y)}{2m} + \cfrac{\lambda}{m}w$$

where, $X$ is the $m\times n$ matrix, $w$ is the $n\times 1$ matrix, $Y$ is the $m\times 1$ matrix and $c$ is a constant.

And the vectorized gradients would be:
$$ \cfrac{\partial J}{\partial w} = \cfrac{{X}^{T}\left[ Xw+c-Y\right]}{m}+\cfrac{\lambda}{m}w$$
$$ \cfrac{\partial J}{\partial c} = \cfrac{sum(Xw+c-Y)}{m}$$

Plugging these gradients into gradient descent equation, the training equations would be:

$$ w=w-\alpha \left( \cfrac{{X}^{T}\left[ Xw+c-Y\right]}{m}+\cfrac{\lambda}{m}w \right)$$

And for bias $c$,
$$c=c-\alpha \left( \cfrac{sum(Xw+c-Y)}{m} \right)$$

What we are using here is batch gradient descent. There are other methods too like stochastic gradient descent, mini-batch gradient descent and so on which we would discuss on further articles.

The training is done until the weight and biases converge to some value and that would give the linear regression model as shown in $(1)$.

That is it for today. We would have more articles on linear regression from other angles of Mathematics too.
Thank you!

Here I go...

Tuesday 29 May 2018

Linear Regression - Part(1) [Machine Learning Perspective]