Cross-validation is safe to use

To the Editor — The importance of machine learning (ML) to science is now widely recognized. For example, in its last editorial of the last decade1, Nature named ML as the ‘breakthrough’ of the decade: “Few fields are untouched by the machine-learning revolution, from materials science to drug exploration; quantum physics to medicine.” Despite the importance of ML, some basic ML ideas are still poorly understood in the general science community. One such technique is cross-validation2,3. Within ML, cross-validation is so common that it is rare to find a paper that doesn’t use it. However, non-ML scientists commonly misunderstand cross-validation, and avoid its use as they think it is unsafe. Two typical misunderstanding/concerns are quoted below: “Papers using machine learning must contain a dedicated subsection clearly describing the composition of the training dataset. This should include information on the preparation of the cross-validation sets and of an independent test set that is not used in the training process. For data originating from biological sequences the description must furthermore address how homology between sequences is taken into account to ensure that the training and independent test sets do not have identical or near identical examples. Papers using leave-one-out will be editorially rejected unless there is a special circumstance in which it can be argued that this procedure is meaningful for the problem addressed in the paper. Machine learning papers must report the performance on an independent test set. It is not sufficient to report the average error over the individual cross-validation sets.” (From the scope guidelines of the journal Bioinformatics4.) “N-fold cross validation is not a very tough test — or even the way the models are used — as QSAR (quantitative structure activity relation) models are most valuable if they can predict future compounds/ activities. So with cross validation I am concerned there is leakage and independent reviewers may feel the same way, unless you can show them this is not a concern. Independent test sets are a more robust way of assessing the model — selected by date order — which is assessing the model ability to predict the future.” (From a recent private communication with a drug design scientist, slightly edited for sense.) Both quoted statements seem to imply that cross-validation is unsafe and should be replaced by the alternative technique of train/test. This is very puzzling to ML researchers and statisticians, as cross-validation and train/test both share exactly the same set of assumptions. It is therefore unreasonable to permit one technique and not the other. The main use of cross-validation and train/test are the same: to predict how well a predictive ML model will perform on new independent data from the same distribution. The idea of train/test is to use one sample of data (the training data) to learn a predictive ML model. Then to use a second sample of data (the test data — whose true classifications are known but are not told to the predictor) to estimate the error rate of the predictive ML model. Note that there is a loss of efficiency here, as we do not use the full sample to train the ML model. The idea of cross-validation is to divide the data into subsamples. Each subsample is predicted using the ML model learnt from the remaining subsamples, and the estimated error rate is the average error rate from these subsamples2,3,5. The ML model finally used is calculated from all the data. Cross-validation gives a better estimate of the error rate than train/test at the cost of more computation. The ‘leave-one-out’ method of Lachenbruch and Mickey6 is cross-validation with samples equal to the number of examples. Given that cross-validation and train/test do the same job, and make the same assumptions, what could possibly be the reason for concerns about its use? The clue seems to be that both quotes refer to structure in the data: in the Bioinformatics journal case, that structure is the possible homologous relationship between examples; and in the drug design example, the structure is the temporal relationship between examples. Recall that cross-validation and train/test are used to predict how well a predictive ML model will perform on new independent data from the same distribution. This means that if cross-validation or train/test samples are selected with different distributions from future data, then the prediction of performance will be inaccurate. However, this problem is exactly the same for cross-validation and train/test. It is therefore irrational to trust cross-validation less than train/test. In conclusion. ML is now a key technology in modern science. However, its techniques need to be better understood. We therefore call for a dialogue between ML and domain scientists in which ML methods, such as cross-validation, can be explained to domain scientists so that they can trust and benefit from them. ❐


Cross-validation is safe to use
To the Editor -The importance of machine learning (ML) to science is now widely recognized. For example, in its last editorial of the last decade 1 , Nature named ML as the 'breakthrough' of the decade: "Few fields are untouched by the machine-learning revolution, from materials science to drug exploration; quantum physics to medicine. " Despite the importance of ML, some basic ML ideas are still poorly understood in the general science community.
One such technique is cross-validation 2,3 . Within ML, cross-validation is so common that it is rare to find a paper that doesn't use it. However, non-ML scientists commonly misunderstand cross-validation, and avoid its use as they think it is unsafe. Two typical misunderstanding/concerns are quoted below: "Papers using machine learning must contain a dedicated subsection clearly describing the composition of the training dataset. This should include information on the preparation of the cross-validation sets and of an independent test set that is not used in the training process. For data originating from biological sequences the description must furthermore address how homology between sequences is taken into account to ensure that the training and independent test sets do not have identical or near identical examples. Papers using leave-one-out will be editorially rejected unless there is a special circumstance in which it can be argued that this procedure is meaningful for the problem addressed in the paper. Machine learning papers must report the performance on an independent test set. It is not sufficient to report the average error over the individual cross-validation sets. " (From the scope guidelines of the journal Bioinformatics 4 .) "N-fold cross validation is not a very tough test -or even the way the models are used -as QSAR (quantitative structure activity relation) models are most valuable if they can predict future compounds/ activities. So with cross validation I am concerned there is leakage and independent reviewers may feel the same way, unless you can show them this is not a concern. Independent test sets are a more robust way of assessing the model -selected by date order -which is assessing the model ability to predict the future. " (From a recent private communication with a drug design scientist, slightly edited for sense.) Both quoted statements seem to imply that cross-validation is unsafe and should be replaced by the alternative technique of train/test. This is very puzzling to ML researchers and statisticians, as cross-validation and train/test both share exactly the same set of assumptions. It is therefore unreasonable to permit one technique and not the other. The main use of cross-validation and train/test are the same: to predict how well a predictive ML model will perform on new independent data from the same distribution.
The idea of train/test is to use one sample of data (the training data) to learn a predictive ML model. Then to use a second sample of data (the test data -whose true classifications are known but are not told to the predictor) to estimate the error rate of the predictive ML model. Note that there is a loss of efficiency here, as we do not use the full sample to train the ML model. The idea of cross-validation is to divide the data into subsamples. Each subsample is predicted using the ML model learnt from the remaining subsamples, and the estimated error rate is the average error rate from these subsamples 2,3,5 . The ML model finally used is calculated from all the data. Cross-validation gives a better estimate of the error rate than train/test at the cost of more computation. The 'leave-one-out' method of Lachenbruch and Mickey 6 is cross-validation with samples equal to the number of examples.
Given that cross-validation and train/test do the same job, and make the same assumptions, what could possibly be the reason for concerns about its use? The clue seems to be that both quotes refer to structure in the data: in the Bioinformatics journal case, that structure is the possible homologous relationship between examples; and in the drug design example, the structure is the temporal relationship between examples. Recall that cross-validation and train/test are used to predict how well a predictive ML model will perform on new independent data from the same distribution. This means that if cross-validation or train/test samples are selected with different distributions from future data, then the prediction of performance will be inaccurate. However, this problem is exactly the same for cross-validation and train/test. It is therefore irrational to trust cross-validation less than train/test.
In conclusion. ML is now a key technology in modern science. However, its techniques need to be better understood. We therefore call for a dialogue between ML and domain scientists in which ML methods, such as cross-validation, can be explained to domain scientists so that they can trust and benefit from them. ❐