7.2 Retrieving open data

A vast and ever-increasing amount of geographic data is available on the internet, much of which is free to access and use (with appropriate credit given to its providers).In some ways there is now too much data, in the sense that there are often multiple places to access the same dataset.Some datasets are of poor quality.In this context, it is vital to know where to look, so the first section covers some of the most important sources.Various ‘geoportals’ (web services providing geospatial datasets such as Data.gov) are a good place to start, providing a wide range of data but often only for specific locations (as illustrated in the updated Wikipedia page on the topic).

Some global geoportals overcome this issue.The GEOSS portal and the Copernicus Open Access Hub, for example, contain many raster datasets with global coverage.A wealth of vector datasets can be accessed from the National Aeronautics and Space Administration agency (NASA), SEDAC portal and the European Union’s INSPIRE geoportal, with global and regional coverage.

Most geoportals provide a graphical interface allowing datasets to be queried based on characteristics such spatial and temporal extent, the United States Geological Services’ EarthExplorer being a prime example.Exploring datasets interactively on a browser is an effective way of understanding available layers.Downloading data is best done with code, however, from reproducibility and efficiency perspectives.Downloads can be initiated from the command line using a variety of techniques, primarily via URLs and APIs (see the Sentinel API for example).Files hosted on static URLs can be downloaded with download.file(), as illustrated in the code chunk below which accesses US National Parks data from: catalog.data.gov/dataset/national-parks:

  1. download.file(url = "http://nrdata.nps.gov/programs/lands/nps_boundary.zip",
  2. destfile = "nps_boundary.zip")
  3. unzip(zipfile = "nps_boundary.zip")
  4. usa_parks = st_read(dsn = "nps_boundary.shp")