Why Do We Split Datasets?
This article is a detailed explanation for the interview question below.
Why is it important to create a separate evaluation split of a dataset when performing model/algorithm tuning in supervised learning?
We create predictive models to be able to guess the outcome for the unseen data. In order to measure how a model performs on new instances, we keep some part of the data “unseen” by the model.
What are train and test splits?
We randomly separate the dataset into two parts: train data and test data. We use the train split for actual training and the test split to measure the model performance.
The data that is used to train the machine learning model is called train data. We give the train data to the model and we expect the machine learning model to make the predictions out of it. The model is built on the data it discovers in the training dataset.
The unseen data that is used to test the machine learning model is called test data. Once the algorithm is built, we use test data to check the machine learning algorithm to find out if the predictions are right or wrong.
We justify our algorithm performance by test split of the data. We can not use the training set to measure the model performance because the model might memorize data.
We compare the train and test set performances and prevent our model from underfitting and overfitting by regularizations and optimizations. We improve the generalization power of our predictive algorithm by measuring its performance on training data.
Deciding Split Percentages
How big each training and testing set should be?
There is not a strict rule for deciding the sizes. Training size can be 66–80 % and testing size can be 33–20 %.