Handling of Spatial Data
Introduction
Spatial data is different from nonspatial data by having a spatial reference information attached to each observation.
This information is usually stored as coordinates, often named x
and y
.
Coordinates are either stored in UTM (Universal Transverse Mercator) or latitude/longitude format.
Treating spatial data sets like nonspatial ones leads to overoptimistic results in predictive accuracy of models (Brenning 2005). This is due to the underlying spatial autocorrelation in the data. Spatial autocorrelation does occur in all spatial data sets. Magnitude varies depending on the characteristics of the data set. The closer observations are located to each other, the more similar they are.
If common validation procedures like crossvalidation are applied to such data sets, they assume independence of the observation upfront to provide unbiased estimates. However, this assumption is violated in the spatial case due to spatial autocorrelation. Subsequently, nonspatial crossvalidation will fail to provide accurate performance estimates.
By doing a random sampling of the data set (i.e., nonspatial sampling), training and test set data are often located directly next to each other (in geographical space). Hence, the test set will contain observations which are somewhat similar (due to spatial autocorrelation) to observations in the training set. This leads to the effect that the model, which was trained on the training set, performs quite well on the test data because it already knows it to some degree.
To reduce this bias on the resulting predictive accuracy estimate, Brenning 2005 suggested using spatial partitioning in favor of random partitioning (see Figure 1). Here, spatial clusters are equal to the number of folds chosen. These spatially disjoint subsets of the data introduce a spatial distance between training and test set. This reduces the influence of spatial autocorrelation and subsequently also the overoptimistic predictive accuracy estimates. The example in Figure 1 shows a fivefold nested crossvalidation setting and exhibits the difference between spatial and nonspatial partitioning. The nested approach is used when hyperparameter tuning is performed.
How to use spatial partitioning in mlr
Spatial partitioning can be used when performing crossvalidation.
In any resample call you can choose SpCV
or SpRepCV
to use it.
While SpCV
will perform a spatial crossvalidation with only one repetition, SpRepCV
gives you the option to choose any number of repetitions.
As a rule of thumb, usually 100 repetitions are used with the aim to reduce variance introduced by partitioning.
There are some prerequisites for this:
When specifying the task, you need to provide spatial coordinates through argument coordinates
.
The supplied data.frame
needs to have the same number of rows as the data and at least two dimensions need to be supplied.
This means that a 3D partitioning might also work but we have not tested this explicitly.
Coordinates need to be numeric so it is suggested to use a UTM projection.
If this applies, the coordinates will be used for spatial partitioning if SpCV
or SpRepCV
is selected as resampling strategy.
Examples
The spatial.task
data set serves as an example data set for spatial modeling tasks in mlr
.
The task argument coordinates
is a data.frame
with two coordinates that will later be used for the spatial partitioning of the data set.
In this example, the "Random Forest" algorithm (package ranger) is used to model a binomial response variable.
For performance assessment, a repeated spatial crossvalidation with 5 folds and 5 repetitions is chosen.
Spatial CrossValidation
data("spatial.task")
spatial.task
#> Supervised task: ecuador
#> Type: classif
#> Target: slides
#> Observations: 751
#> Features:
#> numerics factors ordered functionals
#> 10 0 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Has coordinates: TRUE
#> Classes: 2
#> FALSE TRUE
#> 251 500
#> Positive class: TRUE
learner.rf = makeLearner("classif.ranger", predict.type = "prob")
resampling = makeResampleDesc("SpRepCV", fold = 5, reps = 5)
set.seed(123)
out = resample(learner = learner.rf, task = spatial.task,
resampling = resampling, measures = list(auc))
#> Resampling: repeated spatial crossvalidation
#> Measures: auc
#> [Resample] iter 1: 0.6750295
#> [Resample] iter 2: 0.5390011
#> [Resample] iter 3: 0.7126168
#> [Resample] iter 4: 0.7179386
#> [Resample] iter 5: 0.4481074
#> [Resample] iter 6: 0.6661747
#> [Resample] iter 7: 0.5231959
#> [Resample] iter 8: 0.7140047
#> [Resample] iter 9: 0.7230064
#> [Resample] iter 10: 0.4368742
#> [Resample] iter 11: 0.6793610
#> [Resample] iter 12: 0.6912488
#> [Resample] iter 13: 0.5822304
#> [Resample] iter 14: 0.8494331
#> [Resample] iter 15: 0.6187330
#> [Resample] iter 16: 0.7044860
#> [Resample] iter 17: 0.6549153
#> [Resample] iter 18: 0.6441667
#> [Resample] iter 19: 0.5603976
#> [Resample] iter 20: 0.5034305
#> [Resample] iter 21: 0.6673554
#> [Resample] iter 22: 0.4422466
#> [Resample] iter 23: 0.7228560
#> [Resample] iter 24: 0.7178289
#> [Resample] iter 25: 0.5420875
#>
#> Aggregated Result: auc.test.mean=0.6294690
#>
mean(out$measures.test$auc)
#> [1] 0.629469
We can check for the introduced spatial autocorrelation bias here by performing the same modeling task using a nonspatial partitioning setting.
To do so, we simply choose RepCV
instead of SpRepCV
.
There is no need to remove coordinates
from the task as it is only used if SpCV
or SpRepCV
is selected as the resampling strategy.
NonSpatial CrossValidation
learner.rf = makeLearner("classif.ranger", predict.type = "prob")
resampling = makeResampleDesc("RepCV", fold = 5, reps = 5)
set.seed(123)
out = resample(learner = learner.rf, task = spatial.task,
resampling = resampling, measures = list(auc))
#> Resampling: repeated crossvalidation
#> Measures: auc
#> [Resample] iter 1: 0.7362637
#> [Resample] iter 2: 0.7651485
#> [Resample] iter 3: 0.8119974
#> [Resample] iter 4: 0.6807436
#> [Resample] iter 5: 0.7441992
#> [Resample] iter 6: 0.7769076
#> [Resample] iter 7: 0.8383069
#> [Resample] iter 8: 0.7276557
#> [Resample] iter 9: 0.7341821
#> [Resample] iter 10: 0.7563874
#> [Resample] iter 11: 0.7542955
#> [Resample] iter 12: 0.7972408
#> [Resample] iter 13: 0.7444000
#> [Resample] iter 14: 0.7083333
#> [Resample] iter 15: 0.7405476
#> [Resample] iter 16: 0.7296736
#> [Resample] iter 17: 0.7212000
#> [Resample] iter 18: 0.7715856
#> [Resample] iter 19: 0.7898859
#> [Resample] iter 20: 0.7915686
#> [Resample] iter 21: 0.7391717
#> [Resample] iter 22: 0.7454031
#> [Resample] iter 23: 0.8092320
#> [Resample] iter 24: 0.7267977
#> [Resample] iter 25: 0.7643745
#>
#> Aggregated Result: auc.test.mean=0.7562201
#>
mean(out$measures.test$auc)
#> [1] 0.7562201
The introduced bias (caused by spatial autocorrelation) in performance in this example is around 0.12 AUROC.
Notes

Some models are more affected by spatial autocorrelation than others. In general, it can be said that the more flexible a model is, the more it will profit from underlying spatial autocorrelation. Simpler models (e.g., GLM) will show less overoptimistic performance estimates.

The concept of spatial crossvalidation was originally implemented in package sperrorest. This package comes with even more partitioning options and the ability to visualize the spatial grouping of folds. We plan to integrate more functions from sperrorest into
mlr
so stay tuned! 
For more detailed information, see Brenning 2005 and Brenning2012.
Complete code listing
The above code without the output is given below:
library("mlr")
data("spatial.task")
spatial.task
learner.rf = makeLearner("classif.ranger", predict.type = "prob")
resampling = makeResampleDesc("SpRepCV", fold = 5, reps = 5)
set.seed(123)
out = resample(learner = learner.rf, task = spatial.task,
resampling = resampling, measures = list(auc))
mean(out$measures.test$auc)
learner.rf = makeLearner("classif.ranger", predict.type = "prob")
resampling = makeResampleDesc("RepCV", fold = 5, reps = 5)
set.seed(123)
out = resample(learner = learner.rf, task = spatial.task,
resampling = resampling, measures = list(auc))
mean(out$measures.test$auc)