11 Statistical learning - 11.4 Introduction to (spatial) cross-validation - 《[英文] Geocomputation with R》

11.4 Introduction to (spatial) cross-validation

11.4 Introduction to (spatial) cross-validation

Cross-validation belongs to the family of resampling methods (James et al. 2013).The basic idea is to split (repeatedly) a dataset into training and test sets whereby the training data is used to fit a model which then is applied to the test set.Comparing the predicted values with the known response values from the test set (using a performance measure such as the AUROC in the binomial case) gives a bias-reduced assessment of the model’s capability to generalize the learned relationship to independent data.For example, a 100-repeated 5-fold cross-validation means to randomly split the data into five partitions (folds) with each fold being used once as a test set (see upper row of Figure 11.3).This guarantees that each observation is used once in one of the test sets, and requires the fitting of five models.Subsequently, this procedure is repeated 100 times.Of course, the data splitting will differ in each repetition.Overall, this sums up to 500 models, whereas the mean performance measure (AUROC) of all models is the model’s overall predictive power.

However, geographic data is special.As we will see in Chapter 12, the ‘first law’ of geography states that points close to each other are, generally, more similar than points further away (Miller 2004).This means these points are not statistically independent because training and test points in conventional CV are often too close to each other (see first row of Figure 11.3).‘Training’ observations near the ‘test’ observations can provide a kind of ‘sneak preview’:information that should be unavailable to the training dataset.To alleviate this problem ‘spatial partitioning’ is used to split the observations into spatially disjointed subsets (using the observations’ coordinates in a k-means clustering; Brenning (2012 b); second row of Figure 11.3).This partitioning strategy is the only difference between spatial and conventional CV.As a result, spatial CV leads to a bias-reduced assessment of a model’s predictive performance, and hence helps to avoid overfitting.

Figure 11.3: Spatial visualization of selected test and training observations for cross-validation of one repetition. Random (upper row) and spatial partitioning (lower row).