Classification - explained
posted on 11 Jun 2020 under category tutorial
In this post, we will study classification algorithm, firstly, we will understand what is Classification and then we will get into the difference between classification and regression, evaluation of classification algorithms and then we will explore the algorithms for classification one by one in the next blogs.
Regression and classification are both related to prediction, prediction from labelled data i.e. supervised learning. Regression predicts a continous numerical value, whereas classification predicts the ‘belonging’ to the class.
So, the idea of Classification Algorithms is pretty simple. You predict the target class by analyzing the training dataset. This is one of the most essential concepts you need to study when you learn data science.
What is Classification? We use the training dataset to get a boundary conditions that could be used to determine each target class. Once the boundary conditions are determined, the next task is to predict the target class. The whole process is known as classification.
Examples:
Let me explain this with the help of an example, suppose we are prediting a score ranging between 0 to 1 based on some input features. Now, we can built classification on top of regression, how? if score is greater than 0.5, then we can say class is equal to ‘a’ else class is equal to ‘b’, this is how we can turn a regression into a classification.
Though usually the difference in regression and classification is the loss function. In regression and classification, the goal of the optimization algorithm is to optimize the output of the loss function.
In regression, the loss function increases the further you are away from the correct value for a training instance. In classification, typically the value only increases if you predict the wrong class, and is zero for the right class. The effect is that classification only “cares” if it gets the right class, and there is no since of how “close” it was (or how confident might be another way to think about it), whereas regression’s “goal” is to predict the training values as closely as possible.
Let us use anoother example of house price prediction, we can use regression to predict the price of a house depending on the ‘size’ (sq. feet) and ‘location’ of the house, the proce will be a ‘numerical value’- This relates to regression. Similarly, if instead of the prediction of price, we try to predict the classes such as the price can be classified in labels such as , ‘very costly’, ‘costly’, ‘affordable’, ‘cheap’, and ‘very cheap’ - This relates to classification. Each class may correspond to some range of values.
We will be exploring these algorithms in the next coming tutorials.
Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions out model got right. Formally, accuracy has the following definition:
$Accuracy = \frac{Number\ of\ correct\ predictions}{Total\ number\ of\ predictions}$
For binary classification, acccuracy can also be caclulated in terms of positives and negatives as follows:
$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$
Accuracy alone doesn’t tell the full stroy when you’re working with a class-imbalanced data set where there is a significant disparity between the number of positive and negative labels.
In many cases, accuracy is a poor or misleading metric
An ROC curve(receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification threshold. This curve plots two parameters:
In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.
Each point is the TP and FP rate at one decision threshold
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correcly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
True Positives </br> We correctly called wolf! </br> We saved the town | False Positives </br> Error: we called wolf falsely </br> Everyone is mad as us |
False Negatives </br> There was a wolf, but we didn’t spot it </br> It ate alll our chickens | True Negatives </br> No wolf, no alarm </br> Everyone is fine |
Precision attemps to answer the following question:
What proportion of positive identifications was actually correct?
$Precision = \frac{TP}{TP + FP}$
Recall attemps to answer the following question:
What proportion of actual positive was identified correctly
$Recall = \frac{TP}{TP + FN}$
AUC stands for “Area under the ROC Curve” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0, 0) to (1, 1)
AUC is desirable for the following two reasons:
However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:
Prediction bias is a quantity that measures how far apart the average of predictions and average of observations are. That is:
prediction bias = average of predictions - average of labels in data set