10.3 Geometric algorithms

Algorithms can be understood as the computing equivalent of a cooking recipe.They are a complete set of instructions which, when undertaken on the input (ingredients), result in useful (tasty) outputs.Before diving into a concrete case study, a brief history will show how algorithms relate to scripts (covered in Section 10.2) and functions (which can be used to generalize algorithms, as we’ll see in Section 10.4).

The word “algorithm” originated in 9th century Baghdad with the publication of Hisab al-jabr w’al-muqabala, an early math textbook.The book was translated into Latin and became so popular that the author’s last name, al-Khwārizmī, “was immortalized as a scientific term: Al-Khwarizmibecame Alchoarismi, Algorismi and, eventually, algorithm” (Bellos 2011).In the computing age, algorithm refers to a series of steps that solves a problem, resulting in a pre-defined output.Inputs must be formally defined in a suitable data structure (Wise 2001).Algorithms often start as flow charts or pseudocode showing the aim of the process before being implemented in code.To ease usability, common algorithms are often packaged inside functions, which may hide some or all of the steps taken (unless you look at the function’s source code, see Section 10.4).

Geoalgorithms, such as those we encountered in Chapter 9, are algorithms that take geographic data in and, generally, return geographic results (alternative terms for the same thing include GIS algorithms and geometric algorithms).That may sound simple but it is a deep subject with an entire academic field, Computational Geometry, dedicated to their study (Berg et al. 2008) and numerous books on the subject.O’Rourke (1998), for example, introduces the subject with a range of progressively harder geometric algorithms using reproducible and freely available C code.

An example of a geometric algorithm is one that finds the centroid of a polygon.There are many approaches to centroid calculation, some of which work only on specific types of spatial data.For the purposes of this section, we choose an approach that is easy to visualize: breaking the polygon into many triangles and finding the centroid of each of these, an approach discussed by Kaiser and Morin (1993) alongside other centroid algorithms (and mentioned briefly in O’Rourke 1998).It helps to further break down this approach into discrete tasks before writing any code (subsequently referred to as step 1 to step 4, these could also be presented as a schematic diagram or pseudocode):

  • Divide the polygon into contiguous triangles.
  • Find the centroid of each triangle.
  • Find the area of each triangle.
  • Find the area-weighted mean of triangle centroids.
    These steps may sound straightforward, but converting words into working code requires some work and plenty of trial-and-error, even when the inputs are constrained:The algorithm will only work for convex polygons, which contain no internal angles greater than 180°, no star shapes allowed (packages decido and sfdct can triangulate non-convex polygons using external libraries, as shown in the algorithm vignette at geocompr.github.io).

The simplest data structure of a polygon is a matrix of x and y coordinates in which each row represents a vertex tracing the polygon’s border in order where the first and last rows are identical (Wise 2001).In this case, we’ll create a polygon with five vertices in base R, building on an example from GIS Algorithms(Xiao 2016 see github.com/gisalgs for Python code), as illustrated in Figure 10.2:

  1. # generate a simple matrix representation of a polygon:
  2. x_coords = c(10, 0, 0, 12, 20, 10)
  3. y_coords = c(0, 0, 10, 20, 15, 0)
  4. poly_mat = cbind(x_coords, y_coords)

Now that we have an example dataset, we are ready to undertake step 1 outlined above.The code below shows how this can be done by creating a single triangle (T1), that demonstrates the method; it also demonstrates step 2 by calculating its centroid based on the formula(1/3(a + b + c)) where (a) to (c) are coordinates representing the triangle’s vertices:

  1. # create a point representing the origin:
  2. Origin = poly_mat[1, ]
  3. # create 'triangle matrix':
  4. T1 = rbind(Origin, poly_mat[2:3, ], Origin)
  5. # find centroid (drop = FALSE preserves classes, resulting in a matrix):
  6. C1 = (T1[1, , drop = FALSE] + T1[2, , drop = FALSE] + T1[3, , drop = FALSE]) / 3

Illustration of polygon centroid calculation problem.
Figure 10.2: Illustration of polygon centroid calculation problem.

Step 3 is to find the area of each triangle, so a weighted mean accounting for the disproportionate impact of large triangles is accounted for.The formula to calculate the area of a triangle is as follows (Kaiser and Morin 1993):

[\frac{Ax ( B y − C y ) + B x ( C y − A y ) + C x ( A y − B y )}{ 2 }]

Where (A) to (C) are the triangle’s three points and (x) and (y) refer to the x and y dimensions.A translation of this formula into R code that works with the data in the matrix representation of a triangle T1 is as follows (the function abs() ensures a positive result):

  1. # calculate the area of the triangle represented by matrix T1:
  2. abs(T1[1, 1] * (T1[2, 2] - T1[3, 2]) +
  3. T1[2, 1] * (T1[3, 2] - T1[1, 2]) +
  4. T1[3, 1] * (T1[1, 2] - T1[2, 2]) ) / 2
  5. #> [1] 50

This code chunk outputs the correct result.55The problem is that code is clunky and must by re-typed if we want to run it on another triangle matrix.To make the code more generalizable, we will see how it can be converted into a function in Section 10.4.

Step 4 requires steps 2 and 3 to be undertaken not just on one triangle (as demonstrated above) but on all triangles.This requires iteration to create all triangles representing the polygon, illustrated in Figure 10.3.lapply() and vapply() are used to iterate over each triangle here because they provide a concise solution in base R:56

  1. i = 2:(nrow(poly_mat) - 2)
  2. T_all = lapply(i, function(x) {
  3. rbind(Origin, poly_mat[x:(x + 1), ], Origin)
  4. })
  5. C_list = lapply(T_all, function(x) (x[1, ] + x[2, ] + x[3, ]) / 3)
  6. C = do.call(rbind, C_list)
  7. A = vapply(T_all, function(x) {
  8. abs(x[1, 1] * (x[2, 2] - x[3, 2]) +
  9. x[2, 1] * (x[3, 2] - x[1, 2]) +
  10. x[3, 1] * (x[1, 2] - x[2, 2]) ) / 2
  11. }, FUN.VALUE = double(1))

Illustration of iterative centroid algorithm with triangles. The X represents the area-weighted centroid in iterations 2 and 3.
Figure 10.3: Illustration of iterative centroid algorithm with triangles. The X represents the area-weighted centroid in iterations 2 and 3.

We are now in a position to complete step 4 to calculate the total area with sum(A) and the centroid coordinates of the polygon with weighted.mean(C[, 1], A) and weighted.mean(C[, 2], A) (exercise for alert readers: verify these commands work).To demonstrate the link between algorithms and scripts, the contents of this section have been condensed into 10-centroid-alg.R.We saw at the end of Section 10.2 how this script can calculate the centroid of a square.The great thing about scripting the algorithm is that it works on the new poly_mat object (see exercises below to verify these results with reference to st_centroid()):

  1. source("code/10-centroid-alg.R")
  2. #> [1] "The area is: 245"
  3. #> [1] "The coordinates of the centroid are: 8.83, 9.22"

The example above shows that low-level geographic operations can be developed from first principles with base R.It also shows that if a tried-and-tested solution already exists, it may not be worth re-inventing the wheel:if we aimed only to find the centroid of a polygon, it would have been quicker to represent poly_mat as an sf object and use the pre-existing sf::st_centroid() function instead.However, the great benefit of writing algorithms from 1st principles is that you will understand every step of the process, something that cannot be guaranteed when using other peoples’ code.A further consideration is performance: R is slow compared with low-level languages such as C++ for number crunching (see Section 1.3) and optimization is difficult.If the aim is to develop new methods, computational efficiency should not be prioritized.This is captured in the saying “premature optimization is the root of all evil (or at least most of it) in programming” (Knuth 1974).

Algorithm development is hard.This should be apparent from the amount of work that has gone into developing a centroid algorithm in base R that is just one, rather inefficient, approach to the problem with limited real-world applications (convex polygons are uncommon in practice).The experience should lead to an appreciation of low-level geographic libraries such as GEOS (which underlies sf::st_centroid()) and CGAL (the Computational Geometry Algorithms Library) which not only run fast but work on a wide range of input geometry types.A great advantage of the open source nature of such libraries is that their source code is readily available for study, comprehension and (for those with the skills and confidence) modification.57