11.3 Conventional modeling approach in R

Before introducing the mlr package, an umbrella-package providing a unified interface to dozens of learning algorithms (Section 11.5), it is worth taking a look at the conventional modeling interface in R.This introduction to supervised statistical learning provides the basis for doing spatial CV, and contributes to a better grasp on the mlr approach presented subsequently.

Supervised learning involves predicting a response variable as a function of predictors (Section 11.4).In R, modeling functions are usually specified using formulas (see ?formula and the detailed Formulas in R Tutorial for details of R formulas).The following command specifies and runs a generalized linear model:

  1. fit = glm(lslpts ~ slope + cplan + cprof + elev + log10_carea,
  2. family = binomial(),
  3. data = lsl)

It is worth understanding each of the three input arguments:

  • A formula, which specifies landslide occurrence (lslpts) as a function of the predictors.
  • A family, which specifies the type of model, in this case binomial because the response is binary (see ?family).
  • The data frame which contains the response and the predictors.
    The results of this model can be printed as follows (summary(fit) provides a more detailed account of the results):
  1. class(fit)
  2. #> [1] "glm" "lm"
  3. fit
  4. #>
  5. #> Call: glm(formula = lslpts ~ slope + cplan + cprof + elev + log10_carea,
  6. #> family = binomial(), data = lsl)
  7. #>
  8. #> Coefficients:
  9. #> (Intercept) slope cplan cprof elev
  10. #> 1.97e+00 9.30e-02 -2.57e+01 -1.43e+01 2.41e-05
  11. #> log10_carea
  12. #> -2.12e+00
  13. #>
  14. #> Degrees of Freedom: 349 Total (i.e. Null); 344 Residual
  15. #> Null Deviance: 485
  16. #> Residual Deviance: 361 AIC: 373

The model object fit, of class glm, contains the coefficients defining the fitted relationship between response and predictors.It can also be used for prediction.This is done with the generic predict() method, which in this case calls the function predict.glm().Setting type to response returns the predicted probabilities (of landslide occurrence) for each observation in lsl, as illustrated below (see ?predict.glm):

  1. pred_glm = predict(object = fit, type = "response")
  2. head(pred_glm)
  3. #> 1 2 3 4 5 6
  4. #> 0.3327 0.4755 0.0995 0.1480 0.3486 0.6766

Spatial predictions can be made by applying the coefficients to the predictor rasters.This can be done manually or with raster::predict().In addition to a model object (fit), this function also expects a raster stack with the predictors named as in the model’s input data frame (Figure 11.2).

  1. # making the prediction
  2. pred = raster::predict(ta, model = fit, type = "response")

Spatial prediction of landslide susceptibility using a GLM.
Figure 11.2: Spatial prediction of landslide susceptibility using a GLM.

Here, when making predictions we neglect spatial autocorrelation since we assume that on average the predictive accuracy remains the same with or without spatial autocorrelation structures.However, it is possible to include spatial autocorrelation structures into models (Zuur et al. 2009, 2017; Blangiardo and Cameletti 2015) as well as into predictions (kriging approaches, see, e.g., Goovaerts 1997; Hengl 2007; Bivand, Pebesma, and Gómez-Rubio 2013).This is, however, beyond the scope of this book.

Spatial prediction maps are one very important outcome of a model.Even more important is how good the underlying model is at making them since a prediction map is useless if the model’s predictive performance is bad.The most popular measure to assess the predictive performance of a binomial model is the Area Under the Receiver Operator Characteristic Curve (AUROC).This is a value between 0.5 and 1.0, with 0.5 indicating a model that is no better than random and 1.0 indicating perfect prediction of the two classes.Thus, the higher the AUROC, the better the model’s predictive power.The following code chunk computes the AUROC value of the model with roc(), which takes the response and the predicted values as inputs.auc() returns the area under the curve.

  1. pROC::auc(pROC::roc(lsl$lslpts, fitted(fit)))
  2. #> Area under the curve: 0.826

An AUROC value of 0.83 represents a good fit.However, this is an overoptimistic estimation since we have computed it on the complete dataset.To derive a biased-reduced assessment, we have to use cross-validation and in the case of spatial data should make use of spatial CV.