Day 6 of 100 Days of AI

Yesterday, I skimmed through k-fold cross-validation. I wanted to return to it today because I still wasn’t sure about what exactly we are assessing with it, and why.

Here‘s the clearest Youtube video I found on the topic. The screenshot below from that video provides an easy summary.

Key takeaways:

  • K-fold cross validation is a method of evaluating the performance of a specific model configuration.
    • Model configuration includes the choice of an algorithm (e.g. linear regressions, decision trees, KNNs, neural networks), feature selection (e.g. the independent variables used to predict the dependent variable), and hyperparamters (e.g. the number of neurons in a neural network.)
  • To perform K-fold validation you,
    • (1) split the data into ‘k’ equal parts (the “folds”);
    • (2) you train k-number of models using your chosen ‘model configuration’ across the k-number of folds; for example in the image above, model 1 is trained on folds 2-5 and tested on fold 1; model 2 is trained on folds 1 and 3-5, and tested on fold 2 and so forth;
    • (3) you then assess the performance of each model across the k-folds and aggregate this performance measure, usually using an average (or median, depending on the data type);
    • (4) the final accuracy tells you how good your model configuration is.
  • If your aggregated accuracy is strong, you can then train your model on the whole data set.
  • K-fold cross validation is useful in situations where you have a small amount of data and a small test set. In those instances, the opportunity to test your model is limited, and k-fold can boost that, allowing you to test a model configuration across a larger number of scenarios.