What is Model Validation?
Model validation is a technique to estimate model performance on unseen data. There are two common approaches: hold-out and cross-validation.
For hold-out validation, we split the training data into a training and validation set, which is similar to a test set. The other approach is K-Fold cross-validation, in which you do not need to split the data, but use the entire dataset. Depending on K, the number of folds, the data is divided into K groups; k-1 groups are trained, and then, the last group serves to evaluate. After every group has been used to evaluate the model, the average of all the scores is obtained; and thus, we obtain a validation score .
In practice, K-fold cross-validation is a more robust approach when compared to hold-out. It estimates the model performance without having to sacrifice a validation split. Also, you avoid statistical issues with your validation split (it might be a “lucky” split, especially for imbalanced data). Good values for K are around 5 to 10. Comparing the K-fold validation metrics is a common approach to check the stability of the model performance.
In Driverless AI, with cross validation, the whole dataset is utilized, and each model is trained on a different subset of the training data. The following visualization shows an example of cross validation with 5 folds.
Driverless AI randomly splits the data into the specified number of folds for cross-validation. Users can also use their own pre-defined cross-validation split with the fold column.