K-fold Cross-Validation🌈🌈

Jyothi Panuganti
5 min readMar 8, 2020

Knowing k fold cross-validation with little funny small emojis

What is cross-validation?🌈

  • Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set.
  • Validation of a model includes that the models perform good data on new data and helps in selecting the best model, parameters and accuracy metrics.

📢 The main use of the cross-validation is to estimate how accurately a predictive model will perform in practice, a model usually has given a dataset of known data on which training is run (training dataset) and dataset of unknown data (of first seen data)against which model is tested.

👁️‍🗨️The main aim of validation is to flag problems like overfitting or selection bias.

Uses of K fold Validation😉:

✳️The K fold cross-validation is a procedure used to estimate the skill of the model on new data.

There are commonly used variations on cross-validation such as stratified and repeated that are available in sci-kit learn.

✳️Cross-Validation is primarily used in machine learning to estimate the skill of a machine learning model on unseen data, and also on with an unbalanced data set.

✳️The procedure of k fold validation has a single parameter called k that refers to the number of groups that a given data sample is to be split into. When a specific value for k is chosen it may be used in place of k in the reference to the model such as k=10, becoming 10-fold cross-validation.

✳️By the procedure of k fold validation, we can come to know easily what it is and how does it work, because it generally results in a less biased or less optimistic estimate of the model skill than other models, such as a simple train/test split.

The general purpose is as follows:😓

1. Shuffle the dataset randomly.

2. Split the dataset into k GROUPS

3. Each unique group:

i). Take the group as hold out or test dataset

ii). Take the remaining groups as training data set

iii). Fit a model on the training set and evaluate it on the test set

iv). Retain the evaluation score and discard the model

4. Summarize the skill of the model using the sample model evaluation scores

Configuration of k😋

👁️‍🗨️A poorly chosen value of k may result in a misrepresentative idea of the model such as a score with a high variance or a high bias.👁️‍🗨️

Three important things to choosing a value of k are as mentioned below:

✳️ Representative: The value for k is chosen such that each train or test group of a data sample is large enough to be statistically representative of the large dataset.

✳️k =10: The value for k is fixed to 10, a value that had been found through experimentation to generally result in a model skill estimate with low bias a modest variance.

✳️k = n: The value for k is fie=xed to n where n is the size of the dataset to give each test sample an opportunity to be used in the hold out data set. This approach is called leave one out cross-validation.

Worked Example🧐

To make the cross-validation procedure concrete,

👁️‍🗨️let’s look at an example.

Imagine we have a data sample with 6 observations:

  1. [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

The first step is to pick a value for k in order to determine the number of folds used to split the data. Here, we will use a value of k=3. That means we will shuffle the data and then split the data into 3 groups. Because we have 6 observations, each group will have an equal number of 2 observations.

For example:

1.Fold1: [0.5, 0.2]

2. Fold2: [0.1, 0.3]

3. Fold3: [0.4, 0.6]

We can then make use of the sample, such as to evaluate the skill of a machine learning algorithm.

Three models are trained and evaluated with each fold given a chance to be the held-out test set.

For example:☺️😊

  • Model1: Trained on Fold1 + Fold2, Tested on Fold3
  • Model2: Trained on Fold2 + Fold3, Tested on Fold1
  • Model3: Trained on Fold1 + Fold3, Tested on Fold2

The models are then discarded after they are evaluated as they have served their purpose.

The skill scores are collected for each model and summarized for use.

Variations on Cross-Validation 😲

There are a number of variations on the k-fold cross-validation procedure.

Three commonly used variations are as follows:

✳️Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model.

✳️Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.

✳️Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.

References:

Hey Everyone! 😍

📖Have a Happy Learning!! 📖

Spread the love❤️ around you

--

--

Jyothi Panuganti

Data Science Enthusiast, Blogger, content writer, and Freelancer.