Underfitting and Overfitting

Before moving ahead I assume that my readers already know about the basics of ML. What I am gonna do is write about the all those topics which become difficult for a newbie to understand or to apply.  Topics like overfitting, Cross-validation, Boosting, Bias, Variance, Feature-Engineering etc are the topics which a newbie must understand properly to make an efficient model.

In this post, I am writing about the problem of Underfitting and Overfitting in classification and how to avoid them.









Let's first talk about Overfitting. Suppose Theon Grejoy has an exam ahead of him in Winterfell.  Suppose Theon tries to apply the strategy of memorizing all the previous year questions of the same exam (A strategy which most of the students follow ) rather than analyzing the concept behind the theory. Now during the exam, there are two possibilities. Either He will score well or not. For him to score well, the only possible way is that the professor sets the questions from previous year question paper. And even a dumb star son and daughters like Alia Bhat and Tiger know that professor would never do that. So, unfortunately, Theon is never going to get a decent score if he applies this strategy.
This whole scenario tries to define the term Overfitiing. Overfitting is a situation in classification problems when a model is built in such a way that its priority is to stick with the given input data no matter how much complex that model becomes. You can this in the above-given figure how much complex the given model is. And if we use this model to predict the target for the train data, it would give a 100% accuracy. But if we use the same model to predict for new test data, obviously we will get a lot of errors. It is unable to make the generalisations on the data required to classify unseen data accurately.

What is Underfitting then?
As the name suggests, Underfitting is just opposite of Overfitting. Its just lack of a complex model. Suppose you are learning where a border runs between two countries from labelled samples, then a linear model is what will act as an underfit model in this situation. It will fail to predict the output for new data or even the given train data.

So the problem ahead of us is how do we avoid underfitting or overfitting?
The most basic method is to split the training set into halves. The first half is used to train the classifier, the second half is used to test the classifier (as we know what the correct answer is). The advantage of this is that we can quickly get a percentage accuracy for our classifier. The disadvantage is that we’re testing a half-trained classifier so it doesn’t give us a good indicator of how it will perform with the full training dataset.

According to an online source the strategies to avoid these problems are :
  • collect more data
  • use ensembling methods that “average” models
  • choose simpler models / penalize complexity
For the first point, it may help to plot learning curves.If you see a trend that more data helps with closing the gap between the two, and if you could afford to collect more data, then this would probably the best choice.

Overfitting can be a real problem if our model has too much capacity — too many model parameters to fit, and too many hyperparameters to tune. If the dataset is small, a simple model is always a good option to prevent overfitting.

Go for cross-validation which is another important topic regarding ML and which I would be explaining in another post. Till then best wishes and Happy Diwali

Comments

Popular posts from this blog

Face Recognition using Deep Learning

Nearest Neighbour Classifier