5.2 Geometric operations on vector data

This section is about operations that in some way change the geometry of vector (sf) objects.It is more advanced than the spatial data operations presented in the previous chapter (in Section 4.2), because here we drill down into the geometry:the functions discussed in this section work on objects of class sfc in addition to objects of class sf.

5.2.1 Simplification

Simplification is a process for generalization of vector objects (lines and polygons) usually for use in smaller scale maps.Another reason for simplifying objects is to reduce the amount of memory, disk space and network bandwidth they consume:it may be wise to simplify complex geometries before publishing them as interactive maps.The sf package provides st_simplify(), which uses the GEOS implementation of the Douglas-Peucker algorithm to reduce the vertex count.st_simplify() uses the dTolerance to control the level of generalization in map units (see Douglas and Peucker 1973 for details).Figure 5.1 illustrates simplification of a LINESTRING geometry representing the river Seine and tributaries.The simplified geometry was created by the following command:

  1. seine_simp = st_simplify(seine, dTolerance = 2000) # 2000 m

Comparison of the original and simplified geometry of the seine object.
Figure 5.1: Comparison of the original and simplified geometry of the seine object.

The resulting seine_simp object is a copy of the original seine but with fewer vertices.This is apparent, with the result being visually simpler (Figure 5.1, right) and consuming less memory than the original object, as verified below:

  1. object.size(seine)
  2. #> 17304 bytes
  3. object.size(seine_simp)
  4. #> 8320 bytes

Simplification is also applicable for polygons.This is illustrated using us_states, representing the contiguous United States.As we show in Chapter 6, GEOS assumes that the data is in a projected CRS and this could lead to unexpected results when using a geographic CRS.Therefore, the first step is to project the data into some adequate projected CRS, such as US National Atlas Equal Area (epsg = 2163) (on the left in Figure 5.2):

  1. us_states2163 = st_transform(us_states, 2163)

st_simplify() works equally well with projected polygons:

  1. us_states_simp1 = st_simplify(us_states2163, dTolerance = 100000) # 100 km

A limitation with st_simplify() is that it simplifies objects on a per-geometry basis.This means the ‘topology’ is lost, resulting in overlapping and ‘holy’ areal units illustrated in Figure 5.2 (middle panel).ms_simplify() from rmapshaper provides an alternative that overcomes this issue.By default it uses the Visvalingam algorithm, which overcomes some limitations of the Douglas-Peucker algorithm (Visvalingam and Whyatt 1993).The following code chunk uses this function to simplify us_states2163.The result has only 1% of the vertices of the input (set using the argument keep) but its number of objects remains intact because we set keep_shapes = TRUE:19

  1. # proportion of points to retain (0-1; default 0.05)
  2. us_states2163$AREA = as.numeric(us_states2163$AREA)
  3. us_states_simp2 = rmapshaper::ms_simplify(us_states2163, keep = 0.01,
  4. keep_shapes = TRUE)

Finally, the visual comparison of the original dataset and the two simplified versions shows differences between the Douglas-Peucker (st_simplify) and Visvalingam (ms_simplify) algorithm outputs (Figure 5.2):

Polygon simplification in action, comparing the original geometry of the contiguous United States with simplified versions, generated with functions from sf (center) and rmapshaper (right) packages.
Figure 5.2: Polygon simplification in action, comparing the original geometry of the contiguous United States with simplified versions, generated with functions from sf (center) and rmapshaper (right) packages.

5.2.2 Centroids

Centroid operations identify the center of geographic objects.Like statistical measures of central tendency (including mean and median definitions of ‘average’), there are many ways to define the geographic center of an object.All of them create single point representations of more complex vector objects.

The most commonly used centroid operation is the geographic centroid.This type of centroid operation (often referred to as ‘the centroid’) represents the center of mass in a spatial object (think of balancing a plate on your finger).Geographic centroids have many uses, for example to create a simple point representation of complex geometries, or to estimate distances between polygons.They can be calculated with the sf function st_centroid() as demonstrated in the code below, which generates the geographic centroids of regions in New Zealand and tributaries to the River Seine, illustrated with black points in Figure 5.3.

  1. nz_centroid = st_centroid(nz)
  2. seine_centroid = st_centroid(seine)

Sometimes the geographic centroid falls outside the boundaries of their parent objects (think of a doughnut).In such cases point on surface operations can be used to guarantee the point will be in the parent object (e.g., for labeling irregular multipolygon objects such as island states), as illustrated by the red points in Figure 5.3.Notice that these red points always lie on their parent objects.They were created with st_point_on_surface() as follows:20

  1. nz_pos = st_point_on_surface(nz)
  2. seine_pos = st_point_on_surface(seine)

Centroids (black points) and 'points on surface' (red points) of New Zealand's regions (left) and the Seine (right) datasets.
Figure 5.3: Centroids (black points) and ‘points on surface’ (red points) of New Zealand’s regions (left) and the Seine (right) datasets.

Other types of centroids exist, including the Chebyshev center and the visual center.We will not explore these here but it is possible to calculate them using R, as we’ll see in Chapter 10.

5.2.3 Buffers

Buffers are polygons representing the area within a given distance of a geometric feature:regardless of whether the input is a point, line or polygon, the output is a polygon.Unlike simplification (which is often used for visualization and reducing file size) buffering tends to be used for geographic data analysis.How many points are within a given distance of this line?Which demographic groups are within travel distance of this new shop?These kinds of questions can be answered and visualized by creating buffers around the geographic entities of interest.

Figure 5.4 illustrates buffers of different sizes (5 and 50 km) surrounding the river Seine and tributaries.These buffers were created with commands below, which show that the command st_buffer() requires at least two arguments: an input geometry and a distance, provided in the units of the CRS (in this case meters):

  1. seine_buff_5km = st_buffer(seine, dist = 5000)
  2. seine_buff_50km = st_buffer(seine, dist = 50000)

Buffers around the Seine dataset of 5 km (left) and 50 km (right). Note the colors, which reflect the fact that one buffer is created per geometry feature.
Figure 5.4: Buffers around the Seine dataset of 5 km (left) and 50 km (right). Note the colors, which reflect the fact that one buffer is created per geometry feature.

The third and final argument of st_buffer() is nQuadSegs, which means ‘number of segments per quadrant’ and is set by default to 30 (meaning circles created by buffers are composed of (4 \times 30 = 120) lines).This argument rarely needs to be set.Unusual cases where it may be useful include when the memory consumed by the output of a buffer operation is a major concern (in which case it should be reduced) or when very high precision is needed (in which case it should be increased).

5.2.4 Affine transformations

Affine transformation is any transformation that preserves lines and parallelism.However, angles or length are not necessarily preserved.Affine transformations include, among others, shifting (translation), scaling and rotation.Additionally, it is possible to use any combination of these.Affine transformations are an essential part of geocomputation.For example, shifting is needed for labels placement, scaling is used in non-contiguous area cartograms (see Section 8.6), and many affine transformations are applied when reprojecting or improving the geometry that was created based on a distorted or wrongly projected map.The sf package implements affine transformation for objects of classes sfg and sfc.

  1. nz_sfc = st_geometry(nz)

Shifting moves every point by the same distance in map units.It could be done by adding a numerical vector to a vector object.For example, the code below shifts all y-coordinates by 100,000 meters to the north, but leaves the x-coordinates untouched (left panel of Figure 5.5).

  1. nz_shift = nz_sfc + c(0, 100000)

Scaling enlarges or shrinks objects by a factor.It can be applied either globally or locally. Global scaling increases or decreases all coordinates values in relation to the origin coordinates, while keeping all geometries topological relations intact.It can be done by subtraction or multiplication of asfg or sfc object.

Local scaling treats geometries independently and requires points around which geometries are going to be scaled, e.g., centroids.In the example below, each geometry is shrunk by a factor of two around the centroids (middle panel in Figure 5.5).To achieve that, each object is firstly shifted in a way that its center has coordinates of 0, 0 ((nz_sfc - nz_centroid_sfc)).Next, the sizes of the geometries are reduced by half (* 0.5).Finally, each object’s centroid is moved back to the input data coordinates (+ nz_centroid_sfc).

  1. nz_centroid_sfc = st_centroid(nz_sfc)
  2. nz_scale = (nz_sfc - nz_centroid_sfc) * 0.5 + nz_centroid_sfc

Rotation of two-dimensional coordinates requires a rotation matrix:

[R =\begin{bmatrix}\cos \theta & -\sin \theta \ \sin \theta & \cos \theta \\end{bmatrix}]

It rotates points in a counterclockwise direction.The rotation matrix can be implemented in R as:

  1. rotation = function(a){
  2. r = a * pi / 180 #degrees to radians
  3. matrix(c(cos(r), sin(r), -sin(r), cos(r)), nrow = 2, ncol = 2)
  4. }

The rotation function accepts one argument a - a rotation angle in degrees.Rotation could be done around selected points, such as centroids (right panel of Figure 5.5).See vignette("sf3") for more examples.

  1. nz_rotate = (nz_sfc - nz_centroid_sfc) * rotation(30) + nz_centroid_sfc

Illustrations of affine transformations: shift, scale and rotate.
Figure 5.5: Illustrations of affine transformations: shift, scale and rotate.

Finally, the newly created geometries can replace the old ones with the st_set_geometry() function:

  1. nz_scale_sf = st_set_geometry(nz, nz_scale)

5.2.5 Clipping

Spatial clipping is a form of spatial subsetting that involves changes to the geometry columns of at least some of the affected features.

Clipping can only apply to features more complex than points:lines, polygons and their ‘multi’ equivalents.To illustrate the concept we will start with a simple example:two overlapping circles with a center point one unit away from each other and a radius of one (Figure 5.6).

  1. b = st_sfc(st_point(c(0, 1)), st_point(c(1, 1))) # create 2 points
  2. b = st_buffer(b, dist = 1) # convert points to circles
  3. plot(b)
  4. text(x = c(-0.5, 1.5), y = 1, labels = c("x", "y")) # add text

Overlapping circles.
Figure 5.6: Overlapping circles.

Imagine you want to select not one circle or the other, but the space covered by both x and y.This can be done using the function st_intersection(), illustrated using objects named x and y which represent the left- and right-hand circles (Figure 5.7).

  1. x = b[1]
  2. y = b[2]
  3. x_and_y = st_intersection(x, y)
  4. plot(b)
  5. plot(x_and_y, col = "lightgrey", add = TRUE) # color intersecting area

Overlapping circles with a gray color indicating intersection between them.
Figure 5.7: Overlapping circles with a gray color indicating intersection between them.

The subsequent code chunk demonstrates how this works for all combinations of the ‘Venn’ diagram representing x and y, inspired by Figure 5.1 of the book R for Data Science(Grolemund and Wickham 2016).

Spatial equivalents of logical operators.
Figure 5.8: Spatial equivalents of logical operators.

To illustrate the relationship between subsetting and clipping spatial data, we will subset points that cover the bounding box of the circles x and y in Figure 5.8.Some points will be inside just one circle, some will be inside both and some will be inside neither.stsample() is used below to generate a _simple random distribution of points within the extent of circles x and y, resulting in output illustrated in Figure 5.9.

  1. bb = st_bbox(st_union(x, y))
  2. box = st_as_sfc(bb)
  3. set.seed(2017)
  4. p = st_sample(x = box, size = 10)
  5. plot(box)
  6. plot(x, add = TRUE)
  7. plot(y, add = TRUE)
  8. plot(p, add = TRUE)
  9. text(x = c(-0.5, 1.5), y = 1, labels = c("x", "y"))

Randomly distributed points within the bounding box enclosing circles x and y.
Figure 5.9: Randomly distributed points within the bounding box enclosing circles x and y.

The logical operator way would find the points inside both x and y using a spatial predicate such as st_intersects(), whereas the intersection method simply finds the points inside the intersecting region created above as x_and_y.As demonstrated below the results are identical, but the method that uses the clipped polygon is more concise:

  1. sel_p_xy = st_intersects(p, x, sparse = FALSE)[, 1] &
  2. st_intersects(p, y, sparse = FALSE)[, 1]
  3. p_xy1 = p[sel_p_xy]
  4. p_xy2 = p[x_and_y]
  5. identical(p_xy1, p_xy2)
  6. #> [1] TRUE

5.2.6 Geometry unions

As we saw in Section 3.2.2, spatial aggregation can silently dissolve the geometries of touching polygons in the same group.This is demonstrated in the code chunk below in which 49 us_states are aggregated into 4 regions using base and tidyverse functions (see results in Figure 5.10):

  1. regions = aggregate(x = us_states[, "total_pop_15"], by = list(us_states$REGION),
  2. FUN = sum, na.rm = TRUE)
  3. regions2 = us_states %>% group_by(REGION) %>%
  4. summarize(pop = sum(total_pop_15, na.rm = TRUE))

Spatial aggregation on contiguous polygons, illustrated by aggregating the population of US states into regions, with population represented by color. Note the operation automatically dissolves boundaries between states.
Figure 5.10: Spatial aggregation on contiguous polygons, illustrated by aggregating the population of US states into regions, with population represented by color. Note the operation automatically dissolves boundaries between states.

What is going on in terms of the geometries?Behind the scenes, both aggregate() and summarize() combine the geometries and dissolve the boundaries between them using st_union().This is demonstrated in the code chunk below which creates a united western US:

  1. us_west = us_states[us_states$REGION == "West", ]
  2. us_west_union = st_union(us_west)

The function can take two geometries and unite them, as demonstrated in the code chunk below which creates a united western block incorporating Texas (challenge: reproduce and plot the result):

  1. texas = us_states[us_states$NAME == "Texas", ]
  2. texas_union = st_union(us_west_union, texas)

5.2.7 Type transformations

Geometry casting is a powerful operation that enables transformation of the geometry type.It is implemented in the st_cast function from the sf package.Importantly, st_cast behaves differently on single simple feature geometry (sfg) objects, simple feature geometry column (sfc) and simple features objects.

Let’s create a multipoint to illustrate how geometry casting works on simple feature geometry (sfg) objects:

  1. multipoint = st_multipoint(matrix(c(1, 3, 5, 1, 3, 1), ncol = 2))

In this case, st_cast can be useful to transform the new object into linestring or polygon (Figure 5.11):

  1. linestring = st_cast(multipoint, "LINESTRING")
  2. polyg = st_cast(multipoint, "POLYGON")

Examples of linestring and polygon casted from a multipoint geometry.
Figure 5.11: Examples of linestring and polygon casted from a multipoint geometry.

Conversion from multipoint to linestring is a common operation that creates a line object from ordered point observations, such as GPS measurements or geotagged media.This allows spatial operations such as the length of the path traveled.Conversion from multipoint or linestring to polygon is often used to calculate an area, for example from the set of GPS measurements taken around a lake or from the corners of a building lot.

The transformation process can be also reversed using st_cast:

  1. multipoint_2 = st_cast(linestring, "MULTIPOINT")
  2. multipoint_3 = st_cast(polyg, "MULTIPOINT")
  3. all.equal(multipoint, multipoint_2, multipoint_3)
  4. #> [1] TRUE

For single simple feature geometries (sfg), st_cast also provides geometry casting from non-multi-types to multi-types (e.g., POINT to MULTIPOINT) and from multi-types to non-multi-types.However, only the first element of the old object would remain in the second group of cases.

Geometry casting of simple features geometry column (sfc) and simple features objects works the same as for single geometries in most of the cases.One important difference is the conversion between multi-types to non-multi-types.As a result of this process, multi-objects are split into many non-multi-objects.

Table 5.1 shows possible geometry type transformations on simple feature objects.Each input simple feature object with only one element (first column) is transformed directly into another geometry type.Several of the transformations are not possible, for example, you cannot convert a single point into a multilinestring or a polygon (so the cells [1, 4:5] in the table are NA).On the other hand, some of the transformations are splitting the single element input object into a multi-element object.You can see that, for example, when you cast a multipoint consisting of five pairs of coordinates into a point.

Table 5.1: Geometry casting on simple feature geometries (see Section 2.1) with input type by row and output type by column. Values like (1) represent the number of features; NA means the operation is not possible.Abbreviations: POI, LIN, POL and GC refer to POINT, LINESTRING, POLYGON and GEOMETRYCOLLECTION. The MULTI version of these geometry types is indicated by a preceding M, e.g., MPOI is the acronym for MULTIPOINT.
POIMPOILINMLINPOLMPOLGC
POI(1)111NANANANA
MPOI(1)41111NANA
LIN(1)51111NANA
MLIN(1)7221NANANA
POL(1)511111NA
MPOL(1)101NA1211
GC(1)91NANANANA1

Let’s try to apply geometry type transformations on a new object, multilinestring_sf, as an example (on the left in Figure 5.12):

  1. multilinestring_list = list(matrix(c(1, 4, 5, 3), ncol = 2),
  2. matrix(c(4, 4, 4, 1), ncol = 2),
  3. matrix(c(2, 4, 2, 2), ncol = 2))
  4. multilinestring = st_multilinestring((multilinestring_list))
  5. multilinestring_sf = st_sf(geom = st_sfc(multilinestring))
  6. multilinestring_sf
  7. #> Simple feature collection with 1 feature and 0 fields
  8. #> geometry type: MULTILINESTRING
  9. #> dimension: XY
  10. #> bbox: xmin: 1 ymin: 1 xmax: 4 ymax: 5
  11. #> epsg (SRID): NA
  12. #> proj4string: NA
  13. #> geom
  14. #> 1 MULTILINESTRING ((1 5, 4 3)...

You can imagine it as a road or river network.The new object has only one row that defines all the lines.This restricts the number of operations that can be done, for example it prevents adding names to each line segment or calculating lengths of single lines.The st_cast function can be used in this situation, as it separates one mutlilinestring into three linestrings:

  1. linestring_sf2 = st_cast(multilinestring_sf, "LINESTRING")
  2. linestring_sf2
  3. #> Simple feature collection with 3 features and 0 fields
  4. #> geometry type: LINESTRING
  5. #> dimension: XY
  6. #> bbox: xmin: 1 ymin: 1 xmax: 4 ymax: 5
  7. #> epsg (SRID): NA
  8. #> proj4string: NA
  9. #> geom
  10. #> 1 LINESTRING (1 5, 4 3)
  11. #> 2 LINESTRING (4 4, 4 1)
  12. #> 3 LINESTRING (2 2, 4 2)

Examples of type casting between MULTILINESTRING (left) and LINESTRING (right).
Figure 5.12: Examples of type casting between MULTILINESTRING (left) and LINESTRING (right).

The newly created object allows for attributes creation (see more in Section 3.2.4) and length measurements:

  1. linestring_sf2$name = c("Riddle Rd", "Marshall Ave", "Foulke St")
  2. linestring_sf2$length = st_length(linestring_sf2)
  3. linestring_sf2
  4. #> Simple feature collection with 3 features and 2 fields
  5. #> geometry type: LINESTRING
  6. #> dimension: XY
  7. #> bbox: xmin: 1 ymin: 1 xmax: 4 ymax: 5
  8. #> epsg (SRID): NA
  9. #> proj4string: NA
  10. #> geom name length
  11. #> 1 LINESTRING (1 5, 4 3) Riddle Rd 3.61
  12. #> 2 LINESTRING (4 4, 4 1) Marshall Ave 3.00
  13. #> 3 LINESTRING (2 2, 4 2) Foulke St 2.00