13.3 Tidy the input data

The German government provides gridded census data at either 1 km or 100 m resolution.The following code chunk downloads, unzips and reads in the 1 km data.

  1. download.file("https://tinyurl.com/ybtpkwxz",
  2. destfile = "census.zip", mode = "wb")
  3. unzip("census.zip") # unzip the files
  4. census_de = readr::read_csv2(list.files(pattern = "Gitter.csv"))

As a convenience to the reader, the corresponding data has been put into spDataLarge and can be accessed as follows

  1. data("census_de", package = "spDataLarge")

The census_de object is a data frame containing 13 variables for more than 300,000 grid cells across Germany.For our work, we only need a subset of these: Easting (x) and Northing (y), number of inhabitants (population; pop), mean average age (mean_age), proportion of women (women) and average household size (hh_size).These variables are selected and renamed from German into English in the code chunk below and summarized in Table 13.1.Further, mutate_all() is used to convert values -1 and -9 (meaning unknown) to NA.

  1. # pop = population, hh_size = household size
  2. input = dplyr::select(census_de, x = x_mp_1km, y = y_mp_1km, pop = Einwohner,
  3. women = Frauen_A, mean_age = Alter_D,
  4. hh_size = HHGroesse_D)
  5. # set -1 and -9 to NA
  6. input_tidy = mutate_all(input, list(~ifelse(. %in% c(-1, -9), NA, .)))
Table 13.1: Categories for each variable in census data from Datensatzbeschreibung…xlsx located in the downloaded file census.zip (see Figure 13.1 for their spatial distribution).
classPopulation% femaleMean ageHousehold size
13-2500-400-401-2
2250-50040-4740-422-2.5
3500-200047-5342-442.5-3
42000-400053-6044-473-3.5
54000-8000>60>47>3.5
6>8000