Boise, Idaho, Photo by Alden Skeie on Unsplash

Cross Validation

K-Fold Cross Validation Explained…

Seyma Tas
2 min readMar 24, 2021

--

In last week’s article, I wrote about train-test splits. However, there is a problem with separating the data into only two splits. Since we create random samples of data, the test and train performances can be very different depending on our train-test split. We must validate our model more than one time. We use K-Fold Cross Validation technique to deal with this issue.

K-Fold Cross Validation

We separate the dataset into k slices of equal size and train-test the model k times with k different partitions. 1 slice is the test set and k-1 slice is the train set for each training period.

K-Fold Cross Validation in Scikit-Learn

Scikit-Learn offers a lot of cross validation techniques.

Below is an example of how k-fold cross validation is applied to a linear regression model.

  1. Separate the data into train and test splits:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

2. Initiate, fit, and predict:

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

linreg.fit(X_train, y_train)
y_predict_test = linreg.predict(X_test)

3. Apply 5-fold cross validation and find the mean:

from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score

mse = make_scorer(mean_squared_error)

5fold_cv_results = np.mean(cross_val_score(linreg, X, y, cv=5, scoring=mse))

--

--