Handling of Spatial Data

Introduction

Spatial data is different from non-spatial data by having a spatial reference information attached to each observation. This information is usually stored as coordinates, often named x and y. Coordinates are either stored in UTM (Universal Transverse Mercator) or latitude/longitude format.

Treating spatial data sets like non-spatial ones leads to overoptimistic results in predictive accuracy of models (Brenning 2005). This is due to the underlying spatial autocorrelation in the data. Spatial autocorrelation does occur in all spatial data sets. Magnitude varies depending on the characteristics of the data set. The closer observations are located to each other, the more similar they are.

If common validation procedures like cross-validation are applied to such data sets, they assume independence of the observation upfront to provide unbiased estimates. However, this assumption is violated in the spatial case due to spatial autocorrelation. Subsequently, non-spatial cross-validation will fail to provide accurate performance estimates.

Nested Spatial and Non-Spatial Validation

By doing a random sampling of the data set (i.e., non-spatial sampling), training and test set data are often located directly next to each other (in geographical space). Hence, the test set will contain observations which are somewhat similar (due to spatial autocorrelation) to observations in the training set. This leads to the effect that the model, which was trained on the training set, performs quite well on the test data because it already knows it to some degree.

To reduce this bias on the resulting predictive accuracy estimate, Brenning 2005 suggested using spatial partitioning in favor of random partitioning (see Figure 1). Here, spatial clusters are equal to the number of folds chosen. These spatially disjoint subsets of the data introduce a spatial distance between training and test set. This reduces the influence of spatial autocorrelation and subsequently also the overoptimistic predictive accuracy estimates. The example in Figure 1 shows a five-fold nested cross-validation setting and exhibits the difference between spatial and non-spatial partitioning. The nested approach is used when hyperparameter tuning is performed.

How to use spatial partitioning in mlr

Spatial partitioning can be used when performing cross-validation. In any resample call you can choose SpCV or SpRepCV to use it. While SpCV will perform a spatial cross-validation with only one repetition, SpRepCV gives you the option to choose any number of repetitions. As a rule of thumb, usually 100 repetitions are used with the aim to reduce variance introduced by partitioning.

There are some prerequisites for this:

When specifying the task, you need to provide spatial coordinates through argument coordinates. The supplied data.frame needs to have the same number of rows as the data and at least two dimensions need to be supplied. This means that a 3-D partitioning might also work but we have not tested this explicitly. Coordinates need to be numeric so it is suggested to use a UTM projection. If this applies, the coordinates will be used for spatial partitioning if SpCV or SpRepCV is selected as resampling strategy.

Examples

The spatial.task data set serves as an example data set for spatial modeling tasks in mlr. The task argument coordinates is a data.frame with two coordinates that will later be used for the spatial partitioning of the data set.

In this example, the "Random Forest" algorithm (package ranger) is used to model a binomial response variable.

For performance assessment, a repeated spatial cross-validation with 5 folds and 5 repetitions is chosen.

Spatial Cross-Validation

data("spatial.task")
spatial.task
#> Supervised task: ecuador
#> Type: classif
#> Target: slides
#> Observations: 751
#> Features:
#>    numerics     factors     ordered functionals 
#>          10           0           0           0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Has coordinates: TRUE
#> Classes: 2
#> FALSE  TRUE 
#>   251   500 
#> Positive class: TRUE

learner.rf = makeLearner("classif.ranger", predict.type = "prob")

resampling = makeResampleDesc("SpRepCV", fold = 5, reps = 5)

set.seed(123)
out = resample(learner = learner.rf, task = spatial.task,
  resampling = resampling, measures = list(auc))
#> Resampling: repeated spatial cross-validation
#> Measures:             auc
#> [Resample] iter 1:    0.6750295
#> [Resample] iter 2:    0.5390011
#> [Resample] iter 3:    0.7126168
#> [Resample] iter 4:    0.7179386
#> [Resample] iter 5:    0.4481074
#> [Resample] iter 6:    0.6661747
#> [Resample] iter 7:    0.5231959
#> [Resample] iter 8:    0.7140047
#> [Resample] iter 9:    0.7230064
#> [Resample] iter 10:   0.4368742
#> [Resample] iter 11:   0.6793610
#> [Resample] iter 12:   0.6912488
#> [Resample] iter 13:   0.5822304
#> [Resample] iter 14:   0.8494331
#> [Resample] iter 15:   0.6187330
#> [Resample] iter 16:   0.7044860
#> [Resample] iter 17:   0.6549153
#> [Resample] iter 18:   0.6441667
#> [Resample] iter 19:   0.5603976
#> [Resample] iter 20:   0.5034305
#> [Resample] iter 21:   0.6673554
#> [Resample] iter 22:   0.4422466
#> [Resample] iter 23:   0.7228560
#> [Resample] iter 24:   0.7178289
#> [Resample] iter 25:   0.5420875
#> 
#> Aggregated Result: auc.test.mean=0.6294690
#> 

mean(out$measures.test$auc)
#> [1] 0.629469

We can check for the introduced spatial autocorrelation bias here by performing the same modeling task using a non-spatial partitioning setting. To do so, we simply choose RepCV instead of SpRepCV. There is no need to remove coordinates from the task as it is only used if SpCV or SpRepCV is selected as the resampling strategy.

Non-Spatial Cross-Validation

learner.rf = makeLearner("classif.ranger", predict.type = "prob")

resampling = makeResampleDesc("RepCV", fold = 5, reps = 5)

set.seed(123)
out = resample(learner = learner.rf, task = spatial.task,
  resampling = resampling, measures = list(auc))
#> Resampling: repeated cross-validation
#> Measures:             auc
#> [Resample] iter 1:    0.7362637
#> [Resample] iter 2:    0.7651485
#> [Resample] iter 3:    0.8119974
#> [Resample] iter 4:    0.6807436
#> [Resample] iter 5:    0.7441992
#> [Resample] iter 6:    0.7769076
#> [Resample] iter 7:    0.8383069
#> [Resample] iter 8:    0.7276557
#> [Resample] iter 9:    0.7341821
#> [Resample] iter 10:   0.7563874
#> [Resample] iter 11:   0.7542955
#> [Resample] iter 12:   0.7972408
#> [Resample] iter 13:   0.7444000
#> [Resample] iter 14:   0.7083333
#> [Resample] iter 15:   0.7405476
#> [Resample] iter 16:   0.7296736
#> [Resample] iter 17:   0.7212000
#> [Resample] iter 18:   0.7715856
#> [Resample] iter 19:   0.7898859
#> [Resample] iter 20:   0.7915686
#> [Resample] iter 21:   0.7391717
#> [Resample] iter 22:   0.7454031
#> [Resample] iter 23:   0.8092320
#> [Resample] iter 24:   0.7267977
#> [Resample] iter 25:   0.7643745
#> 
#> Aggregated Result: auc.test.mean=0.7562201
#> 

mean(out$measures.test$auc)
#> [1] 0.7562201

The introduced bias (caused by spatial autocorrelation) in performance in this example is around 0.12 AUROC.

Notes

Complete code listing

The above code without the output is given below:

library("mlr") 
data("spatial.task") 
spatial.task 

learner.rf = makeLearner("classif.ranger", predict.type = "prob") 

resampling = makeResampleDesc("SpRepCV", fold = 5, reps = 5) 

set.seed(123) 
out = resample(learner = learner.rf, task = spatial.task, 
  resampling = resampling, measures = list(auc)) 

mean(out$measures.test$auc) 
learner.rf = makeLearner("classif.ranger", predict.type = "prob") 

resampling = makeResampleDesc("RepCV", fold = 5, reps = 5) 

set.seed(123) 
out = resample(learner = learner.rf, task = spatial.task, 
  resampling = resampling, measures = list(auc)) 

mean(out$measures.test$auc)