Monday, 2 July 2018

Linear Regression - Part(2) [Matrix Calculus]

Previously, I had posted part(1) of the linear regression, where we dealt with machine learning perspective.  And, I had promised to present it with other angles of Mathematics too and here I am!
In this article, we would see it with matrix calculus.

The link to previous article is given below.
http://avidsuraj.blogspot.com/2018/05/linear-algebra-part1-machine-learning.html

I have used the same notation used in the previous article.  So, readers are recommended to read the previous article or at least notation portion of it.

The cost function or error function is written as:
$$ J = \sum  _{i=1}^{m} {{{e}^{(i)}}^{2}}$$
where, ${e}^{(i)}$ is the error for ${i}^{th}$ training example.

Let $E$ represent matrix of all the errors.  So, it is given by:
$$ E = Y-{Y}_{pred} \\
E = Y-(XW+C) \\
E = Y-XW-C $$
Differentiation of a vector, Y with respect to a vector X


Tuesday, 29 May 2018

Linear Regression - Part(1) [Machine Learning Perspective]

Motivation:

Suppose we want to predict the rent price of a room.  Prediction requires some features or variables on the basis of which we are going to predict.  We think for a while and maybe the rooms which are near to the road cost more.  Similarly, the rooms which have shutters in them are expensive as such rooms are valuable from the commercial perspective.  We also see that big rooms cost more than smaller rooms.  And more and more...

We want to take a handful of those features and based on them, make some prediction regarding rent price.  How could we do it?  How could we model it?  One of the techniques in modeling such rent price model is linear regression and we will discuss it here.

Sunday, 11 March 2018

Visualization Challenge on Simple Linear Algebra Problem

A $n \times n$ matrix multiplies with $n$ dimensional column vector to produce another $n$ dimensional column vector.  Here, we do the visualization of multiplication of matrix and vector to produce another vector and then see how many combinations of either vector or matrix are possible.

Dimensionally, it is given as:
$$[n\times n]\times [n \times 1]=[n \times 1]$$

The task of visualization is:
Suppose you multiply $n\times n$ matrix with $n$ dimensional column vector to produce another column vector which obviously has $n$ dimensions.
If your two vectors are known, you would have infinite number of choices for the matrix.  But if the matrix and second vector were given, you would have just one choice for the first vector if any.
Try to visualize it and convince to yourself geometrically.

Now, I would show how I did the visualization and convinced myself.*

Multiplication of $n \times n$ matrix with $n$ dimensional column vector is finding linear combination of column vectors of the matrix where the linear coefficients are given by components of the left hand side column vector.  The linear combination of those matrix column vectors would give the column vector of the right hand side.

If two vectors are given, we would have fixed linear coefficients and fixed second column vector.  That means we should choose the matrix column vectors in such a way that their linear combination, given the linear coefficients, would produce the right hand side column vector.  If we take two dimensional space and then try to visualize, it can be easily intuited.  Given a vector(of RHS), we can produce it from infinite number of combinations of two other vectors (with linear coefficients given) if we can see it by taking parallelogram law of vector addition.  Given two points as opposite edges of parallelogram, there are infinite number of ways the other two opposite points can be located to form a parallelogram.

If the matrix and second vector are given to us, let's see how we can visualize to convince ourselves that there is only one vector choice if any.  The problem translates to finding the linear coefficients such that the linear combination of the given matrix vectors would give right hand side column vector.  The linear combination of given $n$ dimensional column vectors giving given $n$ dimensional column vector has only one set of linear coefficients.  This is because each set of weights when used in doing linear combination of matrix column vectors, would give each unique vector.  So, there can be only one combination if any of the linear coefficients.  So, the vector can be only one.  It  can be 0 if the matrix column vectors are not independent.

*Also liked to add that the way people do visualization and have intuition differs from individual to individual.  This is my attempt to explain how I did the visualization.  Convincing to yourself about why something is true matters the most than how you convinced in my opinion.  Have fun!

Thursday, 1 March 2018

How Does ReLU Function Give Non-linearity?

Rectified Linear Unit(ReLU) function is one of the activation functions used massively in neural networks.  ReLU function and its variants have gained massive popularity and use in the pools of other activation functions like sigmoid, tanh, etc because of its advantages over other activation functions which we would discuss in another separate article.

ReLU function is defined as:
$$ ReLU(x)=\begin{cases} x\quad if\quad x>0 \\ 0\quad if\quad x\le 0 \end{cases} $$

Its graph is drawn as:


Sigmoid function in its structure has curved shape but ReLU function lacks the curved structure that sigmoid function has.  So, it is normal to wonder and question ReLU's implementation on how does it achieve non-linearity.  I will try here to give intuition on it.

The first intuition you can get is by looking at the shape of ReLU function above.  Linear function forms the lines, straight lines.  But the ReLU function is not straight line rather a piecewise function that looks broken at the value of x equal to 0.  That gives little intuition on its non-linearity.

Let's delve into it further now.  Suppose there is the data that has input and output distribution which can be modeled with absolute value function.  We employ neural network into it and then use ReLU as activation.  The training of the model selects the biases and variables in such a way that the neural network mimics absolute value function.

I would take two-layers neural network and would manually choose value of weights and biases such that the neural network mimics absolute value function.  I have one input(x) and one output(y) and in the single hidden layer, I have two nodes.  So, the neural network model becomes:


The value of output(y) then becomes:
$$ y=max({ w }_{ 1 }^{ [2] }max\left( { w }_{ 1 }^{ [1] }x+{ b }_{ 1 },0 \right) +{ w }_{ 2 }^{ [2] }max\left( { w }_{ 2 }^{ [1] }x+{ b }_{ 1 },0 \right) +{ b }_{ 2 },0) $$

I put manually the weights as:
$$ { w }_{ 1 }^{ [2] }={ w }_{ 2 }^{ [2] }=1,\quad { w }_{ 1 }^{ [1] }=1,{ w }_{ 2 }^{ [1] }=-1,\quad { b }_{ 1 }=0,{ b }_{ 2 }=0 $$

So, the equation becomes:
$$ y=max(max\left( x,0 \right) +max\left( -x,0 \right) ,0) $$

For $ x \ge 0 $, we have:
$$ y=max(max\left( x,0 \right) +max\left( -x,0 \right) ,0)\\ y=max(x,0)=x $$

And, for $x<0$, we get:
$$ y=max(max\left( x,0 \right) +max\left( -x,0 \right) ,0)\\ y=max(-x,0)\\ y=-x $$

We got the definition of the overall neural network as absolute value function.  Hence, it is proved that the parameters I gave modeled the neural network as absolute value function.

Using custom parameters, I am able to model absolute value function by using the neural network.  Accordingly, we can model more complex function by using the bigger neural network with appropriate parameters.
Hopefully, I gave you more intuition here about non-linearity of ReLU integrated networks.

Now, let's try to do model classification algorithm by using non-linearity property of ReLU functions.  Suppose the points inside a unit circle are of one class and that outside the unit circle are of another class.  Let's see if we can model the boundary using ReLU activations.

Let 0 represent the class when the point is inside the circle and 1 represent the class when the point is outside the circle.  I would try to approximate the circle boundary using four straight lines which we would show being formed by a ReLU activated neural network.

The exact boundary and approximated boundary are shown below.

Now, let's try to model approximate boundary using the neural network here.
Let me remind that the points inside boundary are represented by 0 and the points outside the circle are represented by 1.

I choose the network in such a way that it approximates the boundary and the classification model.

Here, I have four nodes in the hidden layer, two inputs, x and y, and then an output classifying 0 or 1.

The output(O) can be constructed as:
$$ O=(max(x+y-1,0)+max(-x+y-1,0)+max(-x-y-1,0)+max(x-y-1))>0 $$

We would find O from x and y using the neural network now with final output layer being a relational operator.

The neural network is constructed as shown in the figure below.



This neural network constructed above does the classification of the specified problem taking the approximated boundary.  The more nodes we make in the hidden layer, the more close the approximated boundary is to the exact boundary.  So, to construct the complex boundary we employ the higher number of neurons.  But there is always the chance of overfitting to the training set so we should build the model along with testing in the validation set.

Hopefully, I gave some intuition on non-linearity property of ReLU activations.  Any feedback, suggestions, and comments are highly welcomed.  I hope to bring more and more of the articles in the coming days relating to machine learning, artificial intelligence, mathematics and computer science.

Thank you.

Tuesday, 23 January 2018

Sigmoid Function and Softmax Function, Are They Related?

Sigmoid function is an activation function that gives value in the range of 0 to 1.  Values very much less than 0 gives activation value close to 0 and values very much greater than 0 gives activation value close to 1.
It can be expressed mathematically as:
$$ \sigma (x) = \cfrac {1} {1+{e}^{-x}}$$

Sigmoid function is used as the final activation function when we need to find if something is True.  It gives the probability that something is True or False as the truth we define.  For example, if we are to find if the given picture is cat, we may use sigmoid function as the final activating function.  But if we are to do multi-class problem, shall we use sigmoid function for all the classes or is there any other function?

Softmax function is widely used for multi-class classification problem.  But how does it compare with sigmoid function applied to all the classes?

Block diagram of logistic regression


It turns out that having sigmoid function for all the classes is same as having softmax function. It does not matter which activation function we would use.  We would be doing same thing.  Training using either of the functions is mathematically identical.  Let's see it how for single-class classification problem.

We employ sigmoid function giving result the probability of some picture being a cat.  That can be said as the two class classification problem having labels as:
1) Being a cat
2) Not being a cat

So, using sigmoid function as final activation function, we would get equivalent two-classes final activation as:
$$ \left( {a}_{1} \quad {a}_{2}\right)=\left( \cfrac {1} {1 + {e}^{-{z}_{1}}} \quad \cfrac {{e}^{-{z}_{1}}} {1+{e}^{-{z}_{1}}} \right)\quad \dots \quad (1)$$
Note that we got ${a}_{2}$ by subtracting ${a}_{1}$ from $1$.

That is same as applying sigmoid function on each classes output.  In that case we would get:
$$ \left( { a }_{ 1 }\quad { a }_{ 2 } \right) =\left( \cfrac { 1 }{ 1+{ e }^{ -{ z }_{ 1 } } } \quad \cfrac { 1 }{ 1+{ e }^{ -{ z }_{ 2 } } }  \right)  $$

Equating the value of ${a}_{2}$ and simplifying, we would get:
$$ {z}_{1}=-{z}_{2}\quad\dots\quad(2)$$

Putting $(2)$ into $(1)$,
$$\left( {a}_{1} \quad {a}_{2}\right)=\left( \cfrac {1} {1 + {e}^{-{z}_{1}}} \quad \cfrac {{e}^{{z}_{2}}} {1+{e}^{{z}_{2}}} \right)$$

Now, multiplying ${a}_{1}$ by ${e}^{\frac{{z}_1} {2}}$ and ${a}_{2}$ by ${e}^{-\frac{{z}_2} {2}}$ in both numerator and denominator, we would have:
$$\left( {a}_{1} \quad {a}_{2}\right)=\left( \cfrac {{e}^{\frac{{z}_1} {2}}} {{e}^{\frac{{z}_1} {2}} + {e}^{\frac{-{z}_{1}} {2}}} \quad \cfrac {{e}^{\frac{{z}_{2}}{2}}} {{e}^{-\frac{{z}_{2}}{2}}+{e}^{\frac{{z}_{2}}{2}}} \right)\quad\dots\quad(3)$$

Using $(2)$ in $(3)$,
$$\left( {a}_{1} \quad {a}_{2}\right)=\left( \cfrac {{e}^{\frac{{z}_1} {2}}} {{e}^{\frac{{z}_1} {2}} + {e}^{\frac{{z}_{2}} {2}}} \quad \cfrac {{e}^{\frac{{z}_{2}}{2}}} {{e}^{\frac{{z}_{1}}{2}}+{e}^{\frac{{z}_{2}}{2}}} \right)\quad\dots\quad(3)$$

$(3)$ is same as using final activation as softmax function where the $z$ values are $\frac{{z}_{1}}{2}$ and $\frac{{z}_{2}}{2}$.  It is no different than using softmax function on the $z$ values ${z}_{1}$ and ${z}_{2}$ as the two methods differ in just scaling of $z$ values.

So, training using sigmoid function for single class classification as final activation function is just a case for training using softmax function using two classes.  We can see the same true for more than two classes classification cases too.