In last week’s article, I wrote about train-test splits. However, there is a problem with separating the data into only two splits. Since we create random samples of data, the test and train performances can be very different depending on our train-test split. We must validate our model more than one time. We use K-Fold Cross Validation technique to deal with this issue.
K-Fold Cross Validation
We separate the dataset into k slices of equal size and train-test the model k times with k different partitions. 1 slice is the test set and k-1 slice is the train set for each training period.
K-Fold Cross Validation in Scikit-Learn
Scikit-Learn offers a lot of cross validation techniques.
Below is an example of how k-fold cross validation is applied to a linear regression model.
- Separate the data into train and test splits:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
2. Initiate, fit, and predict:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
y_predict_test = linreg.predict(X_test)
3. Apply 5-fold cross validation and find the mean:
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score
mse = make_scorer(mean_squared_error)
5fold_cv_results = np.mean(cross_val_score(linreg, X, y, cv=5, scoring=mse))