4.2 Spatial operations on vector data

This section provides an overview of spatial operations on vector geographic data represented as simple features in the sf package before Section 4.3, which presents spatial methods using the raster package.

4.2.1 Spatial subsetting

Spatial subsetting is the process of selecting features of a spatial object based on whether or not they in some way relate in space to another object.It is analogous to attribute subsetting (covered in Section 3.2.1) and can be done with the base R square bracket ([) operator or with the filter() function from the tidyverse.

An example of spatial subsetting is provided by the nz and nz_height datasets in spData.These contain projected data on the 16 main regions and 101 highest points in New Zealand, respectively (Figure 4.1).The following code chunk first creates an object representing Canterbury, then uses spatial subsetting to return all high points in the region:

  1. canterbury = nz %>% filter(Name == "Canterbury")
  2. canterbury_height = nz_height[canterbury, ]

![Illustration of spatial subsetting with red triangles representing 101 high points in New Zealand, clustered near the central Canterbuy region (left). The points in Canterbury were created with the [ subsetting operator (highlighted in gray, right).](/projects/Robinlovelace-geocompr/02fe2905da12dd7425a0fd8488bbd0d4.png)
Figure 4.1: Illustration of spatial subsetting with red triangles representing 101 high points in New Zealand, clustered near the central Canterbuy region (left). The points in Canterbury were created with the [ subsetting operator (highlighted in gray, right).

Like attribute subsetting x[y, ] subsets features of a target x using the contents of a source object y.Instead of y being of class logical or integer — a vector of TRUE and FALSE values or whole numbers — for spatial subsetting it is another spatial (sf) object.

Various topological relations can be used for spatial subsetting.These determine the type of spatial relationship that features in the target object must have with the subsetting object to be selected, including touches, crosses or within (see Section 4.2.2).Intersects is the default spatial subsetting operator, a default that returns TRUE for many types of spatial relations, including touches, crosses and is within.These alternative spatial operators can be specified with the op = argument, a third argument that can be passed to the [ operator for sf objects.This is demonstrated in the following command which returns the opposite of st_intersect(), points that do not intersect with Canterbury (see Section 4.2.2):

  1. nz_height[canterbury, , op = st_disjoint]

Note the empty argument — denoted with , , — in the preceding code chunk is included to highlight op, the third argument in [ for sf objects.One can use this to change the subsetting operation in many ways.nz_height[canterbury, 2, op = st_disjoint], for example, returns the same rows but only includes the second attribute column (see sf:::[.sf and the ?sf for details).

For many applications, this is all you’ll need to know about spatial subsetting for vector data.In this case, you can safely skip to Section 4.2.2.

If you’re interested in the details, including other ways of subsetting, read on.Another way of doing spatial subsetting uses objects returned by topological operators.This is demonstrated in the first command below:

  1. sel_sgbp = st_intersects(x = nz_height, y = canterbury)
  2. class(sel_sgbp)
  3. #> [1] "sgbp"
  4. sel_logical = lengths(sel_sgbp) > 0
  5. canterbury_height2 = nz_height[sel_logical, ]

In the above code chunk, an object of class sgbp (a sparse geometry binary predicate, a list of length x in the spatial operation) is created and then converted into a logical vector sellogical (containing only TRUE and FALSE values).The function lengths() identifies which features in nz_height intersect with _any objects in y.In this case 1 is the greatest possible value but for more complex operations one could use the method to subset only features that intersect with, for example, 2 or more features from the source object.

Note: another way to return a logical output is by setting sparse = FALSE (meaning ‘return a dense matrix not a sparse one’) in operators such as st_intersects(). The command st_intersects(x = nz_height, y = canterbury, sparse = FALSE)[, 1], for example, would return an output identical sel_logical.Note: the solution involving sgbp objects is more generalisable though, as it works for many-to-many operations and has lower memory requirements.

It should be noted that a logical can also be used with filter() as follows (sparse = FALSE is explained in Section 4.2.2):

  1. canterbury_height3 = nz_height %>%
  2. filter(st_intersects(x = ., y = canterbury, sparse = FALSE))

At this point, there are three versions of canterbury_height, one created with spatial subsetting directly and the other two via intermediary selection objects.To explore these objects and spatial subsetting in more detail, see the supplementary vignettes on subsetting and tidverse-pitfalls.

4.2.2 Topological relations

Topological relations describe the spatial relationships between objects.To understand them, it helps to have some simple test data to work with.Figure 4.2 contains a polygon (a), a line (l) and some points (p), which are created in the code below.

  1. # create a polygon
  2. a_poly = st_polygon(list(rbind(c(-1, -1), c(1, -1), c(1, 1), c(-1, -1))))
  3. a = st_sfc(a_poly)
  4. # create a line
  5. l_line = st_linestring(x = matrix(c(-1, -1, -0.5, 1), ncol = 2))
  6. l = st_sfc(l_line)
  7. # create points
  8. p_matrix = matrix(c(0.5, 1, -1, 0, 0, 1, 0.5, 1), ncol = 2)
  9. p_multi = st_multipoint(x = p_matrix)
  10. p = st_cast(st_sfc(p_multi), "POINT")

Points (p 1 to 4), line and polygon objects arranged to illustrate topological relations.
Figure 4.2: Points (p 1 to 4), line and polygon objects arranged to illustrate topological relations.

A simple query is: which of the points in p intersect in some way with polygon a?The question can be answered by inspection (points 1 and 2 are over or touch the triangle).It can also be answered by using a spatial predicate such as do the objects intersect?This is implemented in sf as follows:

  1. st_intersects(p, a)
  2. #> Sparse geometry binary ..., where the predicate was `intersects'
  3. #> 1: 1
  4. #> 2: 1
  5. #> 3: (empty)
  6. #> 4: (empty)

The contents of the result should be as you expected:the function returns a positive (1) result for the first two points, and a negative result (represented by an empty vector) for the last two.What may be unexpected is that the result comes in the form of a list of vectors.This sparse matrix output only registers a relation if one exists, reducing the memory requirements of topological operations on multi-feature objects.As we saw in the previous section, a dense matrix consisting of TRUE or FALSE values for each combination of features can also be returned when sparse = FALSE:

  1. st_intersects(p, a, sparse = FALSE)
  2. #> [,1]
  3. #> [1,] TRUE
  4. #> [2,] TRUE
  5. #> [3,] FALSE
  6. #> [4,] FALSE

The output is a matrix in which each row represents a feature in the target object and each column represents a feature in the selecting object.In this case, only the first two features in p intersect with a and there is only one feature in a so the result has only one column.The result can be used for subsetting as we saw in Section 4.2.1.

Note that stintersects() returns TRUE for the second feature in the object p even though it just touches the polygon a: _intersects is a ‘catch-all’ topological operation which identifies many types of spatial relation.

The opposite of st_intersects() is st_disjoint(), which returns only objects that do not spatially relate in any way to the selecting object (note [, 1] converts the result into a vector):

  1. st_disjoint(p, a, sparse = FALSE)[, 1]
  2. #> [1] FALSE FALSE TRUE TRUE

st_within() returns TRUE only for objects that are completely within the selecting object.This applies only to the first object, which is inside the triangular polygon, as illustrated below:

  1. st_within(p, a, sparse = FALSE)[, 1]
  2. #> [1] TRUE FALSE FALSE FALSE

Note that although the first point is within the triangle, it does not touch any part of its border.For this reason st_touches() only returns TRUE for the second point:

  1. st_touches(p, a, sparse = FALSE)[, 1]
  2. #> [1] FALSE TRUE FALSE FALSE

What about features that do not touch, but almost touch the selection object?These can be selected using st_is_within_distance(), which has an additional dist argument.It can be used to set how close target objects need to be before they are selected.Note that although point 4 is one unit of distance from the nearest node of a (at point 2 in Figure 4.2), it is still selected when the distance is set to 0.9.This is illustrated in the code chunk below, the second line of which converts the lengthy list output into a logical object:

  1. sel = st_is_within_distance(p, a, dist = 0.9) # can only return a sparse matrix
  2. lengths(sel) > 0
  3. #> [1] TRUE TRUE FALSE TRUE

Functions for calculating topological relations use spatial indices to largely speed up spatial query performance.They achieve that using the Sort-Tile-Recursive (STR) algorithm.The st_join function, mentioned in the next section, also uses the spatial indexing.You can learn more at https://www.r-spatial.org/r/2017/06/22/spatial-index.html.

4.2.3 Spatial joining

Joining two non-spatial datasets relies on a shared ‘key’ variable, as described in Section 3.2.3.Spatial data joining applies the same concept, but instead relies on shared areas of geographic space (it is also know as spatial overlay).As with attribute data, joining adds a new column to the target object (the argument x in joining functions), from a source object (y).

The process can be illustrated by an example.Imagine you have ten points randomly distributed across the Earth’s surface.Of the points that are on land, which countries are they in?Random points to demonstrate spatial joining are created as follows:

  1. set.seed(2018) # set seed for reproducibility
  2. (bb_world = st_bbox(world)) # the world's bounds
  3. #> xmin ymin xmax ymax
  4. #> -180.0 -90.0 180.0 83.6
  5. random_df = tibble(
  6. x = runif(n = 10, min = bb_world[1], max = bb_world[3]),
  7. y = runif(n = 10, min = bb_world[2], max = bb_world[4])
  8. )
  9. random_points = random_df %>%
  10. st_as_sf(coords = c("x", "y")) %>% # set coordinates
  11. st_set_crs(4326) # set geographic CRS

The scenario is illustrated in Figure 4.3.The random_points object (top left) has no attribute data, while the world (top right) does.The spatial join operation is done by st_join(), which adds the name_long variable to the points, resulting in random_joined which is illustrated in Figure 4.3 (bottom left — see 04-spatial-join.R).Before creating the joined dataset, we use spatial subsetting to create world_random, which contains only countries that contain random points, to verify the number of country names returned in the joined dataset should be four (see the top right panel of Figure 4.3).

  1. world_random = world[random_points, ]
  2. nrow(world_random)
  3. #> [1] 4
  4. random_joined = st_join(random_points, world["name_long"])

Illustration of a spatial join. A new attribute variable is added to random points (top left) from source world object (top right) resulting in the data represented in the final panel.
Figure 4.3: Illustration of a spatial join. A new attribute variable is added to random points (top left) from source world object (top right) resulting in the data represented in the final panel.

By default, st_join() performs a left join (see Section 3.2.3), but it can also do inner joins by setting the argument left = FALSE.Like spatial subsetting, the default topological operator used by st_join() is st_intersects().This can be changed with the join argument (see ?st_join for details).In the example above, we have added features of a polygon layer to a point layer.In other cases, we might want to join point attributes to a polygon layer.There might be occasions where more than one point falls inside one polygon.In such a case st_join() duplicates the polygon feature: it creates a new row for each match.

4.2.4 Non-overlapping joins

Sometimes two geographic datasets do not touch but still have a strong geographic relationship enabling joins.The datasets cycle_hire and cycle_hire_osm, already attached in the spData package, provide a good example.Plotting them shows that they are often closely related but they do not touch, as shown in Figure 4.4, a base version of which is created with the following code below:

  1. plot(st_geometry(cycle_hire), col = "blue")
  2. plot(st_geometry(cycle_hire_osm), add = TRUE, pch = 3, col = "red")

We can check if any points are the same st_intersects() as shown below:

  1. any(st_touches(cycle_hire, cycle_hire_osm, sparse = FALSE))
  2. #> [1] FALSE

Figure 4.4: The spatial distribution of cycle hire points in London based on official data (blue) and OpenStreetMap data (red).

Imagine that we need to join the capacity variable in cycle_hire_osm onto the official ‘target’ data contained in cycle_hire.This is when a non-overlapping join is needed.The simplest method is to use the topological operator st_is_within_distance() shown in Section 4.2.2, using a threshold distance of 20 m.Note that, before performing the relation, both objects are transformed into a projected CRS.These projected objects are created below (note the affix _P, short for projected):

  1. cycle_hire_P = st_transform(cycle_hire, 27700)
  2. cycle_hire_osm_P = st_transform(cycle_hire_osm, 27700)
  3. sel = st_is_within_distance(cycle_hire_P, cycle_hire_osm_P, dist = 20)
  4. summary(lengths(sel) > 0)
  5. #> Mode FALSE TRUE
  6. #> logical 304 438

This shows that there are 438 points in the target object cyclehire_P within the threshold distance of cycle_hire_osm_P.How to retrieve the _values associated with the respective cycle_hire_osm_P points?The solution is again with st_join(), but with an addition dist argument (set to 20 m below):

  1. z = st_join(cycle_hire_P, cycle_hire_osm_P, st_is_within_distance, dist = 20)
  2. nrow(cycle_hire)
  3. #> [1] 742
  4. nrow(z)
  5. #> [1] 762

Note that the number of rows in the joined result is greater than the target.This is because some cycle hire stations in cycle_hire_P have multiple matches in cycle_hire_osm_P.To aggregate the values for the overlapping points and return the mean, we can use the aggregation methods learned in Chapter 3, resulting in an object with the same number of rows as the target:

  1. z = z %>%
  2. group_by(id) %>%
  3. summarize(capacity = mean(capacity))
  4. nrow(z) == nrow(cycle_hire)
  5. #> [1] TRUE

The capacity of nearby stations can be verified by comparing a plot of the capacity of the source cycle_hire_osm data with the results in this new object (plots not shown):

  1. plot(cycle_hire_osm["capacity"])
  2. plot(z["capacity"])

The result of this join has used a spatial operation to change the attribute data associated with simple features; the geometry associated with each feature has remained unchanged.

4.2.5 Spatial data aggregation

Like attribute data aggregation, covered in Section 3.2.2, spatial data aggregation can be a way of condensing data.Aggregated data show some statistics about a variable (typically average or total) in relation to some kind of grouping variable.Section 3.2.2 demonstrated how aggregate() and group_by() %>% summarize() condense data based on attribute variables.This section demonstrates how the same functions work using spatial grouping variables.

Returning to the example of New Zealand, imagine you want to find out the average height of high points in each region.This is a good example of spatial aggregation: it is the geometry of the source (y or nz in this case) that defines how values in the target object (x or nz_height) are grouped.This is illustrated using the base aggregate() function below:

  1. nz_avheight = aggregate(x = nz_height, by = nz, FUN = mean)

The result of the previous command is an sf object with the same geometry as the (spatial) aggregating object (nz).17The result of the previous operation is illustrated in Figure 4.5.The same result can also be generated using the ‘tidy’ functions group_by() and summarize() (used in combination with st_join()):

Average height of the top 101 high points across the regions of New Zealand.
Figure 4.5: Average height of the top 101 high points across the regions of New Zealand.

  1. nz_avheight2 = nz %>%
  2. st_join(nz_height) %>%
  3. group_by(Name) %>%
  4. summarize(elevation = mean(elevation, na.rm = TRUE))

The resulting nz_avheight objects have the same geometry as the aggregating object nz but with a new column representing the mean average height of points within each region of New Zealand (other summary functions such as median() and sd() can be used in place of mean()).Note that regions containing no points have an associated elevation value of NA.For aggregating operations which also create new geometries, see Section 5.2.6.

Spatial congruence is an important concept related to spatial aggregation.An aggregating object (which we will refer to as y) is congruent with the target object (x) if the two objects have shared borders.Often this is the case for administrative boundary data, whereby larger units — such as Middle Layer Super Output Areas (MSOAs) in the UK or districts in many other European countries — are composed of many smaller units.

Incongruent aggregating objects, by contrast, do not share common borders with the target (Qiu, Zhang, and Zhou 2012).This is problematic for spatial aggregation (and other spatial operations) illustrated in Figure 4.6.Areal interpolation overcomes this issue by transferring values from one set of areal units to another.Algorithms developed for this task include area weighted and ‘pycnophylactic’ areal interpolation methods (Tobler 1979).

Illustration of congruent (left) and incongruent (right) areal units with respect to larger aggregating zones (translucent blue borders).
Figure 4.6: Illustration of congruent (left) and incongruent (right) areal units with respect to larger aggregating zones (translucent blue borders).

The spData package contains a dataset named incongruent (colored polygons with black borders in the right panel of Figure 4.6) and a dataset named aggregating_zones (the two polygons with the translucent blue border in the right panel of Figure 4.6).Let us assume that the value column of incongruent refers to the total regional income in million Euros.How can we transfer the values of the underlying nine spatial polygons into the two polygons of aggregating_zones?

The simplest useful method for this is area weighted spatial interpolation.In this case values from the incongruent object are allocated to the aggregating_zones in proportion to area; the larger the spatial intersection between input and output features, the larger the corresponding value.For instance, if one intersection of incongruent and aggregating_zones is 1.5 km2 but the whole incongruent polygon in question has 2 km2 and a total income of 4 million Euros, then the target aggregating zone will obtain three quarters of the income, in this case 3 million Euros.This is implemented in st_interpolate_aw(), as demonstrated in the code chunk below.

  1. agg_aw = st_interpolate_aw(incongruent[, "value"], aggregating_zones,
  2. extensive = TRUE)
  3. #> Warning in st_interpolate_aw.sf(incongruent[, "value"],
  4. #> aggregating_zones, : st_interpolate_aw assumes attributes are constant over
  5. #> areas of x
  6. # show the aggregated result
  7. agg_aw$value
  8. #> [1] 19.6 25.7

In our case it is meaningful to sum up the values of the intersections falling into the aggregating zones since total income is a so-called spatially extensive variable.This would be different for spatially intensive variables, which are independent of the spatial units used, such as income per head or percentages.In this case it is more meaningful to apply an average function when doing the aggregation instead of a sum function.To do so, one would only have to set the extensive parameter to FALSE.

4.2.6 Distance relations

While topological relations are binary — a feature either intersects with another or does not — distance relations are continuous.The distance between two objects is calculated with the st_distance() function.This is illustrated in the code chunk below, which finds the distance between the highest point in New Zealand and the geographic centroid of the Canterbury region, created in Section 4.2.1:

  1. nz_heighest = nz_height %>% top_n(n = 1, wt = elevation)
  2. canterbury_centroid = st_centroid(canterbury)
  3. st_distance(nz_heighest, canterbury_centroid)
  4. #> Units: [m]
  5. #> [,1]
  6. #> [1,] 115540

There are two potentially surprising things about the result:

  • It has units, telling us the distance is 100,000 meters, not 100,000 inches, or any other measure of distance.
  • It is returned as a matrix, even though the result only contains a single value.
    This second feature hints at another useful feature of stdistance(), its ability to return _distance matrices between all combinations of features in objects x and y.This is illustrated in the command below, which finds the distances between the first three features in nz_height and the Otago and Canterbury regions of New Zealand represented by the object co.
  1. co = filter(nz, grepl("Canter|Otag", Name))
  2. st_distance(nz_height[1:3, ], co)
  3. #> Units: [m]
  4. #> [,1] [,2]
  5. #> [1,] 123537 15498
  6. #> [2,] 94283 0
  7. #> [3,] 93019 0

Note that the distance between the second and third features in nzheight and the second feature in co is zero.This demonstrates the fact that distances between points and polygons refer to the distance to _any part of the polygon:The second and third points in nzheight are _in Otago, which can be verified by plotting them (result not shown):

  1. plot(st_geometry(co)[2])
  2. plot(st_geometry(nz_height)[2:3], add = TRUE)