Chapter 8 Large data sets
This chapter describes how large spatial and spatiotemporal datasets
can be handled with R, with a focus on packages
For practical use, we classify large data sets as either
- too large to fit in working memory, or
- also too large to fit on the local hard drive, or
- also too large to download it to own compute infrastructure
these three categories correspond very roughly to Gigabyte-, Terabyte- and Petabyte-sized data sets.
8.1 Vector data:
8.1.1 Reading from disk
st_read reads vector data from disk, using GDAL, and
then keeps the data read in working memory. In case the file is
too large to be read in working memory, several options exist to
read parts of the file. The first is to set argument
with a WKT text string containing a geometry; only geometries from
the target file that intersect with this geometry will be returned.
An example is
#  "/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg"
# Reading layer `nc.gpkg' from data source # `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg' # using driver `GPKG' # Simple feature collection with 8 features and 14 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -81.9 ymin: 36 xmax: -80 ymax: 36.6 # Geodetic CRS: NAD27
The second option is to use the
query argument to
which can be any query in “OGR SQL” dialect, which can be used to
select features from a layer, and limit fields. An example is:
# Reading layer `nc.gpkg' from data source # `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg' # using driver `GPKG' # Simple feature collection with 61 features and 2 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -83.3 ymin: 33.9 xmax: -76.1 ymax: 36.6 # Geodetic CRS: NAD27
nc.pgkg is the layer name, which can be obtained from
Sequences of records can be read using
OFFSET, to read records 51-60 use
# Reading layer `nc.gpkg' from data source # `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg' # using driver `GPKG' # Simple feature collection with 10 features and 2 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84 ymin: 35.2 xmax: -75.5 ymax: 36.2 # Geodetic CRS: NAD27
Further query options include selection on geometry type, polygon area. When the dataset queried is a spatial database, then the query is passed on to the database and not interpreted by GDAL; this means that more powerful features will be available. Further information is found in the GDAL documentation under “OGR SQL dialect”.
Very large file files or directories that are zipped can be read
without the need to unzip them, using the
/vsizip (for zip),
/vsigzip (for gzip) or
/vsitar (for tar files) prefix to files;
this is followed by the path to the zip file, and then followed by
the file inside this zip file. Reading files this way may come at
some computatational cost.
8.1.2 Reading from databases, dbplyr
Although GDAL has support for several spatial databases, and as
mentioned above it passes on SQL in the
query argument to the
database, it is sometimes beneficial to directly read and
to a spatial database using the R database drivers for this. An
example of this is:
# Simple feature collection with 3 features and 1 field # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -81.7 ymin: 36.2 xmax: -80.4 ymax: 36.6 # Geodetic CRS: NAD27 # bir74 wkb_geometry # 1 1091 MULTIPOLYGON (((-81.5 36.2,... # 2 487 MULTIPOLYGON (((-81.2 36.4,... # 3 3188 MULTIPOLYGON (((-80.5 36.2,...
A spatial query might look like
# Simple feature collection with 1 feature and 1 field # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -81.7 ymin: 36.2 xmax: -81.2 ymax: 36.6 # Geodetic CRS: NAD27 # bir74 wkb_geometry # 1 1091 MULTIPOLYGON (((-81.5 36.2,...
Here, the intersection is done in the database, and uses the spatial index typically present.
The same mechanism works when using
dplyr with a database backend:
Spatial queries can be formulated and are passed on to the database:
# # A tibble: 1 x 16 # ogc_fid area perimeter cnty_ cnty_id name fips fipsno cress_id bir74 sid74 # <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> # 1 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 # # … with 5 more variables: nwbir74 <dbl>, bir79 <dbl>, sid79 <dbl>, # # nwbir79 <dbl>, wkb_geometry <pq_gmtry>
# # Source: lazy query [?? x 16] # # Database: postgres [edzer@localhost:5432/postgis] # ogc_fid area perimeter cnty_ cnty_id name fips fipsno cress_id bir74 sid74 # <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> # 1 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 # 2 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 # 3 5 0.153 2.21 1832 1832 North… 37131 37131 66 1421 9 # # … with 5 more variables: nwbir74 <dbl>, bir79 <dbl>, sid79 <dbl>, # # nwbir79 <dbl>, wkb_geometry <pq_gmtry>
It should be noted that PostGIS’
ST_Area computes the same
area as the
AREA field in
nc, which is the meaningless value
obtained by assuming the coordinates are projected, although they
8.1.3 Reading from online resources or web services
GDAL drivers support reading from online recourses, by prepending
/vsicurl/ before the URL starting with e.g.
https://. A number of
similar drivers specialized for particular clouds include
for Amazon S3,
/vsigs for Google Cloud Storage,
/vsioss for Alibaba Cloud, or
/vsiswift for OpenStack
Swift Object Storage. These prepositions can be combined e.g. with
/vsizip/ to read a zipped online resource. Depending on the
file format used, reading information this way may always involve
reading the entire file, or reading it multiple times, and may not
always be the most efficient way of handling resources. A format
like “cloud-optimized geotiff” (COG) has been specially designed
to be efficient and resource-friendly in many cases, e.g. for only
reading the metadata, or for only reading overviews (low-resolutions
versions of the full imagery) or spatial segments. COGs can also
be created using the GeoTIFF driver of GDAL, and setting the right
dataset creation options in a
8.1.4 API’s, OpenStreetMap
Although online resource do not have to be stored files but could be created server-side on the fly, typical web services for geospatial data create data on the fly, and give access to this through an API. As an example, data from OpenStreetMap can be bulk downloaded and read locally, e.g. using the GDAL vector driver, but more typical a user wants to obtain a small subset of the data or use the data for a small query. Several R pacages exist that query openstreetmap data:
OpenStreetMapdownloads data as raster tiles, typically used as backdrop or reference for plotting other features
osmdatadownloads vector data as points, lines or polygons in
osmarreturns vector data, but in addition the network topology (as an
igraphobject) that contains how road elements form a network, and has functions that compute the shortest route.
When provided with a correctly formulated API call in the URL the
highly configurable GDAL OSM driver (in
st_read) can read an
“.osm” file (xml) and returns a dataset with five layers:
that have significant tags,
lines with non-area “way” features,
multilinestrings with “relation” features,
“relation” features and
other_relations. A simple and very small
bounding box query to OpenStreetMap could look like
and from this file we can read the layer
lines, and plot its
first attribute by
o = read_sf("data/ms.osm", "lines") p = read_sf("data/ms.osm", "multipolygons") bb = st_bbox(c(xmin=7.595, ymin = 51.969, xmax = 7.598, ymax = 51.970), crs = 4326) plot(st_as_sfc(bb), axes = TRUE, lwd = 2, lty = 2, cex.axis = .5) plot(o[,1], lwd = 2, add = TRUE) plot(st_geometry(p), border = NA, col = '#88888888', add = TRUE)
the result of which is shown in figure 8.1. The overpass API provides a more generic and powerful query functionality to OpenStreetMap data.
8.2 Raster data:
A common challenge with raster datasets is not only that they come in large files (single Sentinel-2 tiles are around 1 Gb), but that many of these files, potentially thousands, are needed to address the area and time period of interest. At time of writing this, the Copernicus program that runs all Sentinel satellites publishes 160 Tb of images per day. This means that a classic pattern in using R, consisting of
- downloading data to local disc,
- loading the data in memory,
- analysing it
is not going to work.
Cloud-based Earth Observation processing platforms like Google Earth Engine (Gorelick et al. 2017) or Sentinel Hub recognize this and let users work with datasets up to the petabyte range rather easily and with a great deal of interactivity. They share the following properties:
- computations are posponed as long as possible (lazy evaluation),
- only the data you ask for are being computed and returned, and nothing more,
- storing intermediate results is avoided in favour of on-the-fly computations,
- maps with useful results are generated and shown quickly to allow for interactive model development.
This is similar to the
dbplyr interface to databases and
cloud-based analytics environments, but differs in the aspect of
what we want to see quickly: rather than the first \(n\) records
dbplyr table, we want a quick overview of the results,
in the form of a map covering the whole area, or part of it, but
at screen resolution rather than native (observation) resolution.
If for instance we want to “see” results for the United States on screen with 1000 x 1000 pixels, we only need to compute results for this many pixels, which corresponds roughly to data on a grid with 3000 m x 3000 m grid cells. For Sentinel-2 data with 10 m resolution, this means we can subsample with a factor 300, giving 3 km x 3 km resolution. Processing, storage and network requirements then drop a factor \(300^2 \approx 10^5\), compared to working on the native 10 m x 10 m resolution. On the platforms mentioned, zooming in the map triggers further computations on a finer resolution and smaller extent.
A simple optimisation that follows these lines is how stars’ plot method works: in case of plotting large rasters, it subsamples the array before it plots, drastically saving time. The degree of subsampling is derived from the plotting region size and the plotting resolution (pixel density). For vector devices, such as pdf, R sets plot resolution to 75 dpi, corresponding to 0.3 mm per pixel. Enlarging plots may reveal this, but replotting to an enlarged devices will create a plot at target density.
stars proxy objects
To handle datasets that are too large to fit in memory,
stars_proxy objects. To demonstrate its use, we will
starsdata package, an R data package with larger datasets
(around 1 Gb total). It can be installed by
We can “load” a Sentinel-2 image from it by
#  7.69e+08
# stars_proxy object with 1 attribute in 1 file(s): # $`MTD_MSIL1C.xml:10m:EPSG_32632` #  "[...]/MTD_MSIL1C.xml:10m:EPSG_32632" # # dimension(s): # from to offset delta refsys point values x/y # x 1 10980 3e+05 10 WGS 84 / UTM z... NA NULL [x] # y 1 10980 6e+06 -10 WGS 84 / UTM z... NA NULL [y] # band 1 4 NA NA NA NA B4,...,B8
# 11160 bytes
and we see that this does not actually load any of the pixel
values, but keeps the reference to the dataset and fills the
dimensions table. (The convoluted
s2 name is needed to point
GDAL to the right file inside the
.zip file containing 115 files
The idea of a proxy object is that we can build expressions like
but that the computations for this are postponed. Only when we
really need the data, e.g. because we want to plot it, is
p * 2 evaluated. We need data when either
- we want to
- we want to write an object to disk, with
- we want to explicitly load an object in memory, with
In case the entire object does not fit in memory,
write_stars choose different strategies to deal with this:
plotfetches only the pixels that can be seen, rather than all pixels available, and
write_starsreads, processes, and writes data chunk by chunk.
Downsampling and chunking is implemented for spatially dense images, not e.g. for dense time series, or other dense dimensions.
As an example, the output of
plot(p), shown in figure 8.2
only fetches the pixels that can be seen on the plot device, rather than the 10980 x 10980 pixels available in each band. The downsampling ratio taken is
#  19
meaning that for every 19 \(\times\) 19 sub-image in the original image, only one pixel is read, and plotted. This value is still a bit too low as it ignores the white space and space for the key on the plotting device.
8.2.2 Operations on proxy objects
A few dedicated methods are available for
#  [ [[<- [<- adrop aggregate #  aperm as.data.frame c coerce dim #  droplevels filter initialize is.na mapView #  Math merge mutate Ops plot #  predict print pull replace_na select #  show slice slotsFromS3 split st_apply #  st_as_sf st_as_stars st_crop st_mosaic st_redimension #  st_sample st_set_bbox transmute write_stars # see '?methods' for accessing help and source code
We have seen
dim reads out
the dimension from the dimensions metadata table.
The three methods that actually fetch data are
st_as_stars reads the actual data into a
stars object, its argument
downsample controls the downsampling
plot does this too, choosing an appropriate
value from the device resolution, and plots the object.
star_proxy object to disc.
All other methods for
stars_proxy objects do not actually operate
on the raster data but add the operations to a to do list,
attached to the object. Only when actual raster data are fetched,
e.g. by calling
st_as_stars, the commands in this list
st_crop limits the extent (area) of the raster that will be
stars_proxy objects, but still doesn’t read
adrop drops empty dimensions,
aperm changes dimension
write_stars reads and processes its input chunk-wise; it has an
chunk_size that lets users control the size of spatial
8.3 Very large data cubes
At some stage, data sets need to be analysed that are so large that downloading them is no longer feasible; even when local storage would be sufficient, network bandwidth may become limiting. Examples are satellite image archives such as those from Landsat and Copernicus (Sentinel-x), or model computations such as the ERA5 (Hersbach et al. 2020), a model reanalysis of the global atmosphere, land surface and ocean waves from 1950 onwards. In such cases it may be most helpful to either gain access to (typically: rent) virtual machines in a cloud where these data are available and nearby (i.e., the data should be stored in the same data center as where the virtual machine is), or to use a system that lets the user carry out such computations without having to worry about virtual machines. Both options will be discussed.
8.3.1 Finding and processing assets
When working on a virtual machine on a cloud, a first task is usally to find the assets (files) to work on. It looks attractive to obtain a file listing, and then parse file names such as
for their metadata including the date of acquisition and the code of the spatial tile covered. Obtaining such a file listing however is usually computationally very demanding, as is the processing of the result, when the number of tiles runs in the many millions.
A solution to this is to use a catalogue. The recently developed
and increasingly deployed STAC, short for spatiotemporal asset
catalogue, provides an API that can be used to query image
collections by properties like bounding box, date, band, and cloud
coverage. The R package
rstac (Brazil Data Cube Team 2021) provides an R interface
to create queries, and manage the information returned.
Processing the resulting files may involve creating a data cube at a lower spatial and/or temporal resolution, from images that may span a range of coordinate reference systems (e.g., several UTM zones). An R package that can do that is gdalcubes (Appel 2020; Appel and Pebesma 2019).
8.3.2 Processing data: GEE, openEO
Platforms that do not require the management and programming of virtual machines in the cloud but provide direct access to the imagery managed include GEE, openEO, and the climate data store.
Google Earth Engine (GEE) is a cloud platform that allows users
to compute on large amounts of Earth Observation data as well
as modelling products (Gorelick et al. 2017). It has powerful analysis
capabilities, including most of the data cube operations explained
in section 6.3. It has an IDE where scripts can
functionality. The code of GEE is not open source, and cannot be
extended by arbitrary user-defined functions in languages like
Python or R. R package
rgee (Aybar 2021) provides an R client
interface to GEE.
Cloud-based data cube processing platforms built entirely around
open source software are arising, several of which using the openEO
API (Schramm et al. 2021). This API has room for user-defined functions (UDFs)
written in Python or R that are being passed on through the API and
executed at the pixel level, e.g. to aggregate or reduce dimensions.
UDFs in R represent the data chunk to be processed as
object, in Python
xarray objects are used.
Other platforms include the Copernicus climate data store (Raoult et al. 2017)
or atmosphere data store, which allow processing of atmospheric
or climate data from ECMWF, including ERA5. An R package with an
interface to both data stores is
ecmwfr (Hufkens 2020).
Appel, Marius. 2020. Gdalcubes: Earth Observation Data Cubes from Satellite Image Collections. https://github.com/appelmar/gdalcubes_R.
Appel, Marius, and Edzer Pebesma. 2019. “On-Demand Processing of Data Cubes from Satellite Image Collections with the Gdalcubes Library.” Data 4 (3): 92. https://www.mdpi.com/2306-5729/4/3/92.
Aybar, Cesar. 2021. Rgee: R Bindings for Calling the Earth Engine Api. https://CRAN.R-project.org/package=rgee.
Brazil Data Cube Team. 2021. Rstac: Client Library for Spatiotemporal Asset Catalog. https://github.com/brazil-data-cube/rstac.
Gorelick, Noel, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. “Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment 202: 18–27. https://doi.org/10.1016/j.rse.2017.06.031.
Hersbach, Hans, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, et al. 2020. “The Era5 Global Reanalysis.” Quarterly Journal of the Royal Meteorological Society 146 (730): 1999–2049. https://doi.org/https://doi.org/10.1002/qj.3803.
Hufkens, Koen. 2020. Ecmwfr: Interface to Ecmwf and Cds Data Web Services. https://github.com/bluegreen-labs/ecmwfr.
Raoult, Baudouin, Cedric Bergeron, Angel López Alós, Jean-Noël Thépaut, and Dick Dee. 2017. “Climate Service Develops User-Friendly Data Store.” ECMWF Newsletter 151: 22–27.
Schramm, Matthias, Edzer Pebesma, Milutin Milenković, Luca Foresta, Jeroen Dries, Alexander Jacob, Wolfgang Wagner, et al. 2021. “The openEO Api–Harmonising the Use of Earth Observation Cloud Services Using Virtual Data Cube Functionalities.” Remote Sensing 13 (6). https://doi.org/10.3390/rs13061125.