Would you read this article ? or not ?
So, now if I have grabbed your attention this article is about Logistic regression.We’ll talk about topics such as hypothesis function, decision boundary, non-linear decision boundary,simplified gradient descent and advanced optimization.
Logistic regression is one of the algorithms used for classification, It is an ideal algorithm to conduct when dependent variable is binary.Meaning when we have to choose from two values whether you will perform a task x or not ? Some other examples are :
- Email : Spam / Ham ?
- Online Transaction Fraud : Yes / No ?
- Tumor type : Malignant / Benign ?
- Will you read this article : Yes / No ? and many more …..
These type of classifications are known as binary classification as we have two values to choose from (either yes or no, or 0 or 1) and they use gradient descent algorithm to locate the optima.
If y is our output value, it should represent as : y ∈ {0,1} where suppose 0 is a negative class and 1 is positive class. y can contain multiple values ,y={0,1,…,n}.
So in this type of condition linear regression does not perform well in classifying the test cases.
So, we need a function that classifies the two situations, or :
0 ≤ h(x)≤1
Hypothesis function :
h(x) = g [θ(transpose) * x]
where g(x) is the “sigmoid function”. You must be thinking what is sigmoid function and how does it work ?
Sigmoid function also called logistic function
You can see for values of x>0 , graph shows prediction with probability 0.5 and above. With this we can lucid our thoughts that logistic regression is a perfect match for us. Logistic cost function looks similar to the below equation :
g(z) = 1 / 1 + e^(-z)
You can plot the similar function and see the visualization which justifies the above equation.
Let h(x) = estimated probability that y=1 on input x example : tumor classification problem.
x = [x;x1] (vector) = [1;tumorSize](vector)
ℏ(x) =0.7 (Assuming) or 70% chance that tumor is “malignant.
h(x) = P(y=1 | x ; θ) // Probability that y=1 given x,parametrized by θ
Decision Boundary
“Decision boundary is the territory of problem space or region of separation of independent variables.
The above is an example of linear decision boundary and below is a non linear decision boundary.
Above decision boundary depicts the boundary of a circle ie. (x-h)² + (y-k)² = r².
Note : Decision boundary is the property of hypothesis function not training set.
Why can’t we apply linear regression ?
We need a function that can converge to a global minima and if we plot linear regression data set to it will result in non-convex function and “Logistic Regression results in a convex function.
Logistic Regression hypothesis Function :
So our function now looks something like this :-
Let’s plot the curve of log(x) and -log(x) individually.
cost = 0 if y=1, h(x)=1
But as h(x) →0 , cost → ∞
Captures intuition that if h(x)=0, predict P(y=1|x;θ) but y=1 will penalize the learning algorithm by very large cost.
Simplified Gradient Descent
Cost ( h(x,y) ) = -y * log(h(x))-(1-y)*log(1-h(x))
Let’s prove this equation is equivalent version of previous version.
substitute y=1: cost = -log(h(x))
substitute y=0: cost = -log(1-h(x)).
Now our new equation will be :
- 1/m*∑(i=1 to m)[y(i)*log h(xi) + (1-yi)*log(1-h (xi)]
You can derive the cost function from principle of maximum likelihood estimation.
Optimization Function :
θj = θj-α*(d/dθj) , where (d/dθj)= 1/m*∑(i=1 to m(h(xi)-yi)*(xj(i)).
The cost function is identical to the cost function of linear regression but our hypothesis function ,h(x) differs in logistic regression.
Advanced optimization techniques :
There are many optimization functions other than gradient descent algorithm to use from. Some of them counts :
Conjugate Gradient , BFGS, L-BFGS algorithm.
Pros: No need to manually pick α(learning rate) , faster than gradient descent, works good on huge data sets.
Cons : More Complex to use.
Extra
Multi-Class classification : One vs all
Multi class classification techniques are used to group your independent variables in a single entity. ex:
Email foldering / tagging : Work, family, friends, acquaintances etc.
Medical Diagrams : Not ill, cold, flu.
Weather : Rainy, cloudy, snowy.
One vs all classification is sometimes called one vs rest classification or multi-class classification algorithm.
huh…Congratulations ! now you know logistic regression and thanks for bearing me till the end. Don’t forget to clap, for this article and for yourself :)