# Chapter 7 Introduction to sf and stars

This chapter introduces R packages `sf`

and `stars`

. `sf`

provides
a table format for simple features, where feature geometries are
carried in a list-column. R package `stars`

was written to support
raster and vector datacubes (Chapter 6), and has
raster data stacks and feature time series as special cases. `sf`

first appeared on CRAN in 2016, `stars`

in 2018. Development of both
packages received support from the R Consortium as well as strong
community engagement. The packages were designed to work together.

All functions operating on `sf`

or `stars`

objects start with `st_`

,
making it easy to recognize them or to search for them when using
command line completion.

## 7.1 Package `sf`

Intended to succeed and replace R packages `sp`

, `rgeos`

and the
vector parts of `rgdal`

, R Package `sf`

(Pebesma 2018) was developed to
move spatial data analysis in R closer to standards-based approaches
seen in the industry and open source projects, to build upon more
modern versions of the open source geospatial software stack (figure
1.6), and to allow for integration of R
spatial software with the tidyverse if desired.

To do so, R package `sf`

provides simple features access
(J. Herring and others 2011), natively, to R. It provides an interface
to several `tidyverse`

packages, in particular to `ggplot2`

,
`dplyr`

and `tidyr`

. It can read and write data through GDAL, execute
geometrical operations using GEOS (for projected coordinates)
or s2geometry (for ellipsoidal coordinates), and carry out
coordinate transformations or conversions using PROJ. External C++
libraries are interfaced using `Rcpp`

(Eddelbuettel 2013).

Package `sf`

represents sets of simple features in `sf`

objects,
a sub-class of a `data.frame`

or tibble. `sf`

objects contain at
least one *geometry list-column* of class `sfc`

, which for each
element contains the geometry as an R object of class `sfg`

.
A geometry list-column acts as a variable in a `data.frame`

or tibble, but has a more complex structure than e.g. numeric or
character variables. Following the convention of `PostGIS`

, all
operations (functions, method) that operate on `sf`

objects or
related start with `st_`

.

An `sf`

object has the following meta-data:

- the name of the (active) geometry column, held in attribute
`sf_column`

- for each non-geometry variable, the attribute-geometry
relationships (section 5.1), held in attribute
`agr`

An `sfc`

geometry list-column has the following meta-data:

- the coordinate reference system held in attribute
`crs`

- the bounding box held in attribute
`bbox`

- the precision held in attribute
`precision`

- the number of empty geometries held in attribute
`n_empty`

These attributes may best be accessed or set by using
functions like `st_bbox`

, `st_crs`

, `st_set_crs`

, `st_agr`

,
`st_set_agr`

, `st_precision`

, and `st_set_precision`

.

### 7.1.1 Creation

One could create an `sf`

object from scratch e.g. by

```
library(sf)
p1 = st_point(c(7.35, 52.42))
p2 = st_point(c(7.22, 52.18))
p3 = st_point(c(7.44, 52.19))
sfc = st_sfc(list(p1, p2, p3), crs = 'OGC:CRS84')
st_sf(elev = c(33.2, 52.1, 81.2), marker = c("Id01", "Id02", "Id03"),
geom = sfc)
```

```
# Simple feature collection with 3 features and 2 fields
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 7.22 ymin: 52.2 xmax: 7.44 ymax: 52.4
# Geodetic CRS: WGS 84
# elev marker geom
# 1 33.2 Id01 POINT (7.35 52.4)
# 2 52.1 Id02 POINT (7.22 52.2)
# 3 81.2 Id03 POINT (7.44 52.2)
```

(ref:sfcomp) components of an `sf`

object
Figure 7.1 gives an explanation of the components printed. Rather than creating objects from scratch, spatial data in R are typically read from an external source, which can be

- an external file,
- a request to a web service, or
- a dataset held in some form in another R package.

The next section introduces reading from files; section 8.1 discusses handling of datasets too large to fit into working memory.

### 7.1.2 Reading and writing

Reading datasets from an exteral “data source” (file, web service,
or even string) is done using `st_read`

:

`# [1] "/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg"`

```
# Reading layer `nc.gpkg' from data source
# `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg'
# using driver `GPKG'
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.3 ymin: 33.9 xmax: -75.5 ymax: 36.6
# Geodetic CRS: NAD27
```

Here, the file name and path `file`

is read from the `sf`

package,
which has a different path on every machine, and hence is guaranteed
to be present on every `sf`

installation.

Command `st_read`

has two arguments: the *data set name* (`dsn`

)
and the *layer*. In the example above, the *geopackage* file
contains only a single layer that is being read. If it had contained
multiple layers, then the first layer would have been read and a
warning would have been emitted. The available layers of a data
set can be queried by

```
# Driver: GPKG
# Available layers:
# layer_name geometry_type features fields
# 1 nc.gpkg Multi Polygon 100 14
```

Simple feature objects can be written with `st_write`

, as in

`# [1] "/tmp/RtmpNqCwUi/file657b74dd35dcd.gpkg"`

```
# Writing layer `layer_nc' to data source
# `/tmp/RtmpNqCwUi/file657b74dd35dcd.gpkg' using driver `GPKG'
# Writing 100 features with 14 fields and geometry type Multi Polygon.
```

where the file format (GPKG) is derived from the file name extension.

### 7.1.3 Subsetting

A very common operation is to subset objects; base R can use `[`

for this. The rules that apply to `data.frame`

objects also apply to
`sf`

objects, e.g. that records 2-5 and columns 3-7 are selected by

but with a few additional features, in particular

- the
`drop`

argument is by default`FALSE`

meaning that the geometry column is*always*selected, and an`sf`

object is returned; when it is set to`TRUE`

and the geometry column*not*selected, it is dropped and a`data.frame`

is returned; - selection with a spatial (
`sf`

,`sfc`

or`sfg`

) object as first argument leads to selection of the features that spatially*intersect*with that object (see next section); other predicates than*intersects*can be chosen by setting parameter`op`

to a function such as`st_covers`

or or any other binary predicate function listed in section 3.2.2

### 7.1.4 Binary predicates

Binary predicates like `st_intersects`

, `st_covers`

etc (section
3.2.2) take two sets of features or feature geometries and return for
all pairs whether the predicate is `TRUE`

or `FALSE`

. For large
sets this would potentially result in a huge matrix, typically filled
mostly with `FALSE`

values and for that reason a sparse representation
is returned by default:

```
# Sparse geometry binary predicate list of length 5, where the predicate
# was `intersects'
# 1: 1, 2
# 2: 1, 2, 3
# 3: 2, 3
# 4: 4, 7
# 5: 5, 6
```

```
# Warning in st_centroid.sf(nc7): st_centroid assumes attributes are constant over
# geometries of x
```

Figure 7.2 shows how the intersections of the first five with the first seven counties can be understood. We can transform the sparse logical matrix into a dense matrix by

```
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# [2,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE
# [3,] FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
# [5,] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
```

The number of countries that each of `nc5`

intersects with is

`# [1] 2 3 2 2 2`

and the other way around, the number of counties in `nc5`

that
intersect with each of the counties in `nc7`

is

`# [1] 2 3 2 1 1 1 1`

The object `i`

is of class `sgbp`

(sparse geometrical binary
predicate), and is a list of integer vectors, with each element
representing a row in the logical predicate matrix holding the column
indices of the `TRUE`

values for that row. It further holds some
metadata like the predicate used, and the total number of columns.
Methods available for `sgbp`

objects include

```
# [1] as.data.frame as.matrix dim Ops print
# [6] t
# see '?methods' for accessing help and source code
```

where the only `Ops`

method available is `!`

, the negation operation.

### 7.1.5 tidyverse

The tidyverse is a collection of data science packages that work
together, described e.g. in (Wickham and Grolemund 2017; Wickham et al. 2019). Package `sf`

has
tidyverse-style read and write functions, `read_sf`

and `write_sf`

,
which return a tibble rather than a `data.frame`

, do not print any
output, and overwrite existing data by default.

Further `tidyverse`

generics with methods for `sf`

objects include
`filter`

, `select`

, `group_by`

, `ungroup`

, `mutate`

, `transmute`

,
`rowwise`

, `rename`

, `slice`

and `summarise`

, `distinct`

, `gather`

,
`spread`

, `nest`

, `unnest`

, `unite`

, `separate`

, `separate_rows`

,
`sample_n`

, and `sample_frac`

. Most of these methods simply manage
the metadata of `sf`

objects, and make sure the geometry remains
present. In case a user wants the geometry to be removed, one can
use `st_set_geometry(nc, NULL)`

or simply coerce to a tibble before
selecting:

```
# # A tibble: 3 x 1
# BIR74
# <dbl>
# 1 1091
# 2 487
# 3 3188
```

The `summarise`

method for `sf`

objects has two special arguments:

`do_union`

(default`TRUE`

) determines whether grouped geometries are unioned on return, so that they form a valid geometry;`is_coverage`

(default`FALSE`

) in case the geometries grouped form a coverage (do not have overlaps), setting this to`TRUE`

speeds up the unioning

The `distinct`

method selects distinct records, where `st_equals`

is used to evaluate distinctness of geometries.

`filter`

can be used with the usual predicates; when one wants to
use it with a spatial predicate, e.g. to select all counties less
than 50 km away from Orange county, one could use

```
sf_use_s2(TRUE)
orange <- nc %>% filter(NAME == "Orange")
wd = st_is_within_distance(nc, orange, units::set_units(50, km))
o50 <- nc %>% filter(lengths(wd) > 0)
nrow(o50)
```

`# [1] 17`

Figure 7.3 shows the results of this analysis,
and in addition a buffer around the county borders; note that this buffer
serves for illustration, it was *not* used to select the counties.

## 7.2 Spatial joins

In regular (left, right or inner) joins, *joined* records from a
pair of tables are reported when one or more selected attributes
match (are identical) in both tables. A spatial join is similar,
but the criterion to join records is not equality of attributes but
a spatial predicate. This leaves a wide variety of options in order
to define *spatially* matching records, using binary predicates
listed in section 3.2.2. The concepts of “left”, “right”,
“inner” or “full” joins remain identical to the non-spatial join
as the options for handling records that have no spatial match.

When using spatial joins, each record may have several matched
records, yielding a large result table. A way to reduce this
complexity may be to select from the matching records the one with
the largest overlap with the target geometry. An example of this
is shown (visually) in figure 7.4; this is done using
`st_join`

with argument `largest = TRUE`

.

```
# example of largest = TRUE:
nc <- st_transform(read_sf(system.file("shape/nc.shp", package="sf")), 2264)
gr = st_sf(
label = apply(expand.grid(1:10, LETTERS[10:1])[,2:1], 1, paste0, collapse = " "),
geom = st_make_grid(nc))
gr$col = sf.colors(10, categorical = TRUE, alpha = .3)
# cut, to check, NA's work out:
gr = gr[-(1:30),]
suppressWarnings(nc_j <- st_join(nc, gr, largest = TRUE))
# the two datasets:
opar = par(mfrow = c(2,1), mar = rep(0,4))
plot(st_geometry(nc_j))
plot(st_geometry(gr), add = TRUE, col = gr$col)
text(st_coordinates(st_centroid(st_geometry(gr))), labels = gr$label)
# the joined dataset:
plot(st_geometry(nc_j), border = 'black', col = nc_j$col)
text(st_coordinates(st_centroid(st_geometry(nc_j))), labels = nc_j$label, cex = .8)
plot(st_geometry(gr), border = 'green', add = TRUE)
```

Another way to reduce the result set is to use `aggregate`

after a
join, all matching records, and union their geometries; see Section
5.4.

### 7.2.1 sampling, gridding, interpolating

Several convenience functions are available in package `sf`

, some
of which will be discussed here. Function `st_sample`

generates
a sample of points randomly sampled from target geometries, where
target geometries can be point, line or polygon geometries. Sampling
strategies can be (completely) random, regular or (with polygons)
triangular. Chapter 11 explains how spatial sampling
(or point pattern simulation) methods available in package `spatstat`

are interfaced through `st_sample`

.

Function `st_make_grid`

creates a square, rectangular or hexagonal
grid over a region, or points with the grid centers or corners. It
was used to create the rectangular grid in figure 7.4.

Function `st_interpolate_aw`

“interpolates” area values to new areas,
as explained in section 5.3, both for intensive
and extensive variables.

## 7.3 Package `stars`

Athough package `sp`

has always had limited support for raster data,
over the last decade R package `raster`

(Hijmans 2020) has clearly been
dominant as the prime package for powerful, flexible and scalable
raster analysis. The raster data model of package `raster`

(and
its successor, `terra`

(Hijmans 2021)) is that of a 2D regular raster,
or a set of raster layers (a “raster stack”). This aligns with
the classical static “GIS world view”, where the world is modelled
as a set of layers, each representing a different theme. A lot of
data available today however is dynamic, and comes as time series
of rasters or raster stacks. A raster stack does not meaningfully
reflect this, requiring the user to keep a register of which layer
represents what.

Also, the `raster`

package, and its successer `terra`

do an excellent
job in scaling computations up to datasizes no larger than the local
storage (the computer’s hard drives), and doing this fast. Recent
datasets however, including satellite imagery, climate model or
weather forecasting data, often no longer fit in local storage
(chapter 8). Package `spacetime`

(Pebesma 2012)
addresses the analysis of time series of vector geometries or raster
grid cells, but does not extend to higher-dimensional arrays.

Here, we introduce package `stars`

for analysing raster and vector
data cubes. The package

- allows for representing dynamic (time varying) raster stacks,
- aims at being scalable, also beyond local disk size,
- provides a strong integration of raster functions in the GDAL library
- in addition to regular grids handles rotated, sheared, rectilinear and curvilinear rasters (figure 1.5),
- provides a tight integration with package
`sf`

, - also handles array data with non-raster spatial dimensions, the
*vector data cubes*, and - follows the tidyverse design principles.

Vector data cubes include for instance time series for simple features, or spatial graph data such as potentially dynamic origin-destination matrices. The concept of spatial vector and raster data cubes was explained in chapter 6.

### 7.3.1 Reading and writing raster data

Raster data typically are read from a file. We use a dataset
containing a section of Landsat 7 scene, with the 6 30m-resolution
bands (bands 1-5 and 7) for a region covering the city of Olinda,
Brazil. We can read the example GeoTIFF file holding a regular,
non-rotated grid from the package `stars`

:

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 1 54 69 68.9 86 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

where we see the offset, cellsize, coordinate reference system, and dimensions. The dimension table reports the following for each dimension:

`from`

: the starting index`to`

: the ending index`offset`

: the dimension value at the start (edge) of the first pixel`delta`

: the cell size; negative`delta`

values indicate that pixel index increases with decreasing dimension values`refsys`

: the reference system`point`

: boolean, indicating whether cell values have point support or cell support`values`

: a list with values or labels associated with each of the dimension’s values`x/y`

: an indicator whether a dimension is associated with a spatial raster x- or y-axis

For regular, rotated or sheared grids or other regularly discretized
dimensions (e.g. time), `offset`

and `delta`

are not `NA`

and values
and `values`

is `NULL`

; for other cases, `offset`

and `delta`

are
`NA`

and `values`

contains one of:

- the sequence of values, or intervals, in case of a rectlinear spatial raster or irregular time dimension
- in case of a vector data cube, geometries associated with the spatial dimension
- in case of a curvilinear raster, the matrix with coordinate values for each raster cell
- in case of a discrete dimension, the band names or labels associated with the dimension values

The object `r`

is of class `stars`

and is a simple list of length
one, holding a three-dimensional array:

`# [1] 1`

`# [1] "array"`

```
# x y band
# 349 352 6
```

and in addition holds an attribute with a dimensions table with all the metadata required to know what the array dimensions refer to, obtained by

We can get the spatial extent of the array by

```
# xmin ymin xmax ymax
# 288776 9110729 298723 9120761
```

Raster data can be written to local disk using `write_stars`

:

where again the data format (in this case, GeoTIFF) is derived from the file extension. As for simple features, reading and writing uses the GDAL library; the list of available drivers for raster data is obtained by

### 7.3.2 Subsetting `stars`

data cubes

Data cubes can be subsetted using `[`

, or using tidyverse verbs. The
first options, `[`

uses for the arguments: attributes first,
followed by dimension. This means that `r[1:2, 101:200, , 5:10]`

selects from `r`

attributes 1-2, index 101-200 for dimension 1, and index
5-10 for dimension 3; omitting dimension 2 means that no subsetting
takes place. For attributes, attributes names or logical vectors
can be used. For dimensions, logical vectors are not supported;
Selecting discontinuous ranges supported when it it is a regular
sequence. By default, `drop`

is FALSE, when set to `TRUE`

dimensions
with a single value are dropped:

```
# x y band
# 100 50 1
```

```
# x y
# 100 50
```

For selecting particular ranges of dimension *values*, one can use
`filter`

(after loading `dplyr`

):

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 5 51 63 64.3 75 242
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 35 289004 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 1 1 NA NA NULL
```

which changes the offset of the \(x\) dimension. Particular cube slices
can also be obtained with `slice`

, e.g.

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 21 49 63 64.4 77 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
```

which drops the singular dimension. `mutate`

can be used on `stars`

objects to add new arrays as functions of existing ones, `transmute`

drops existing ones.

### 7.3.3 Cropping

Further subsetting can be done using spatial objects of class `sf`

, `sfc`

or `bbox`

, e.g. when using the sample raster,

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# L7_ETMs.tif 22 54 66 67.7 78.2 174 2184
# dimension(s):
# from to offset delta refsys point values x/y
# x 157 193 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 159 194 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

selects the circular center region with a diameter of 500 metre, for
the first band shown in figure 7.5,
where we see that pixels outside the spatial object are assigned `NA`

values.
This object still has dimension indexes relative to the `offset`

and `delta`

values of `r`

; we can reset these to a new `offset`

with

```
# from to offset delta refsys point values x/y
# x 1 37 293222 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 36 9116258 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

By default, the resulting raster is cropped to the extent of the selection object; an object with the same dimensions as the input object is obtained with

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# L7_ETMs.tif 22 54 66 67.7 78.2 174 731280
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

Cropping a `stars`

object can alternatively be done directly with `st_crop`

, as in

### 7.3.4 redimensioning `stars`

objects

Package `stars`

uses package `abind`

(Plate and Heiberger 2016) for a number of
array manipulations. One of them is `aperm`

which transposes an
array by permuting it. A method for `stars`

objects is provided, and

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 1 54 69 68.9 86 255
# dimension(s):
# from to offset delta refsys point values x/y
# band 1 6 NA NA NA NA NULL
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
```

permutes the order of dimensions of the resulting object.

Attributes and dimensions can be swapped, using `merge`

and `split`

:

```
# stars object with 2 dimensions and 6 attributes
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# X1 47 67 78 79.1 89 255
# X2 32 55 66 67.6 79 255
# X3 21 49 63 64.4 77 255
# X4 9 52 63 59.2 75 255
# X5 1 63 89 83.2 112 255
# X6 1 32 60 60.0 88 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
```

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# X 1 54 69 68.9 86 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# attributes 1 6 NA NA NA NA X1,...,X6
```

split distributes the `band`

dimension over 6 attributes of a
2-dimensional array, `merge`

reverses this operation. `st_redimension`

can be used for more generic operations, such as splitting a single array
dimension over two new dimensions:

```
# stars object with 4 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 1 54 69 68.9 86 255
# dimension(s):
# from to offset delta refsys point values
# X1 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL
# X2 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL
# X3 1 3 NA NA NA NA NULL
# X4 1 2 NA NA NA NA NULL
```

### 7.3.5 extracting point samples, aggregating

A very common use case for raster data cube analysis is the extraction
of values at certain locations, or computing aggregations over certain
geometries. `st_extract`

extracts point values. We will do this
for a few randomly sampled points over the bounding box of `r`

:

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 12 41.8 63 61 80.5 145
# dimension(s):
# from to offset delta refsys point
# geometry 1 20 NA NA UTM Zone 25, S... TRUE
# band 1 6 NA NA NA NA
# values
# geometry POINT (293002 9115516),...,POINT (290941 9114128)
# band NULL
```

which results in a vector data cube with 4 points and 6 bands.

### 7.3.6 Predictive models

The typical model prediction workflow in R works as follows:

- use a
`data.frame`

with response and predictor variables (covariates) - create a model object based on this
`data.frame`

- call
`predict`

with this model object and the`data.frame`

with target predictor variable values

Package `stars`

provides a `predict`

method for `stars`

objects
that essentially wraps the last step, by creating the `data.frame`

,
calling the `predict`

method for that, and reconstructing a `stars`

object with the predicted values.

We will illustrate this with a trivial two-class example mapping land from sea in the example Landsat data set, using the sample points extracted above, shown in figure 7.6.

```
plot(r[,,,1], reset = FALSE)
col = rep("green", 20)
col[c(8, 14, 15, 18, 19)] = "red"
st_as_sf(e) %>% st_coordinates() %>% text(labels = 1:20, col = col)
```

From this figure, we read “by eye” that the points 8, 14, 15, 18 and 19 are on water, the others on land. Using a linear discriminant (“maximum likelihood”) classifier, we find model predictions as shown in figure 7.7.

```
rs = split(r)
trn = st_extract(rs, pts)
trn$cls = rep("land", 20)
trn$cls[c(8, 14, 15, 18, 19)] = "water"
model = MASS::lda(cls ~ ., st_set_geometry(trn, NULL))
pr = predict(rs, model)
```

Here, we used the `MASS::`

prefix to avoid loading `MASS`

, as that
would mask `select`

from `dplyr`

. Another way would be to load
`MASS`

and unload it later on with `detach()`

.

We also see that the layer plotted in figure 7.7 is a
`factor`

variable, with class labels.

### 7.3.7 GDAL utils

### 7.3.8 Plotting raster data

We can use the base `plot`

method for `stars`

objects, where the plot created with
`plot(r)`

is shown in figure 7.8.

is shown in figure 7.8. The default color scale uses grey tones, and stretches these such that color breaks correspond to data quantiles over all bands (“histogram equalization”). A more familiar view is the RGB or false color composite shown in figure 7.9.

```
par(mfrow = c(1, 2))
plot(r, rgb = c(3,2,1), reset = FALSE, main = "RGB") # rgb
plot(r, rgb = c(4,3,2), main = "False color (NIR-R-G)") # false color
```

Further details and options are given in Chapter 9.

### 7.3.9 Analysing raster data

Element-wise mathematical functions (section 6.3.2)
on `stars`

objects are just passed on to the arrays. This means
that we can call functions and create expressions:

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 0 3.99 4.23 4.12 4.45 5.54
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 1 62 77.5 77.1 94.9 266
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

or even mask out certain values:

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# L7_ETMs.tif 50 64 75 79 90 255 149170
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

or un-mask areas:

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 0 54 69 63 86 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
# band 1 6 NA NA NA NA NULL
```

Dimension-wise, we can apply functions to selected array dimensions
(section 6.3.3)
of stars objects similar to how `apply`

does this to arrays. For
instance, we can compute for each pixel the mean of the 6 band values by

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# mean 25.5 53.3 68.3 68.9 82 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
```

A more meaningful function would e.g. compute the NDVI (normalized differenced vegetation index):

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# ndvi -0.753 -0.203 -0.0687 -0.0643 0.187 0.587
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
```

Alternatively, one could have defined

which is more convenient if the number of bands is large,
but which is also much slower than `ndvi`

as it
needs to be called for every pixel whereas `ndvi`

can be called
once for all pixels, or for large chunks of pixels.
The mean for each band over the whole image is computed by

```
# band mean
# 1 1 79.1
# 2 2 67.6
# 3 3 64.4
# 4 4 59.2
# 5 5 83.2
# 6 6 60.0
```

the result of which is small enough to be printed here as a `data.frame`

. In these
two examples, entire dimensions disappear. Sometimes, this does not
happen (section 6.3.2); we can for instance compute the three quartiles for each band

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 32 60.8 66.5 69.8 78.8 112
# dimension(s):
# from to offset delta refsys point values
# quantile 1 3 NA NA NA NA 25%, 50%, 75%
# band 1 6 NA NA NA NA NULL
```

and see that this *creates* a new dimension, `quantile`

, with three values.
Alternatively, the three quantiles over the 6 bands for each pixel are
obtained by

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 4 55 69.2 67.2 81.2 255
# dimension(s):
# from to offset delta refsys point values x/y
# quantile 1 3 NA NA NA NA 25%, 50%, 75%
# x 1 349 288776 28.5 UTM Zone 25, S... FALSE NULL [x]
# y 1 352 9120761 -28.5 UTM Zone 25, S... FALSE NULL [y]
```

### 7.3.10 Curvilinear rasters

There are several reasons why non-regular rasters occur (figure 1.5). For one, when the data is Earth-bound, a regular raster does not fit the Earth’s surface, which is curved. Other reasons are:

- when we convert or transform a regular raster data into another coordinate reference system, it will become curvilinear unless we resample (warp; section 7.7); warping always goes at the cost of some loss of data and is not reversible.
- observation may lead to irregular rasters; e.g. for satellite swaths, we
may have a regular raster in the direction of the satellite (not
aligned with \(x\) or \(y\)), and rectilinear in the direction perpendicular to that
(e.g. if the sensor discretizes the viewing
*angle*in equal sections)

## 7.4 Vector data cube examples

### 7.4.1 Example: aggregating air quality time series

Air quality data from package `spacetime`

were
obtained from the airBase European air quality data
base.
Daily average PM\(_{10}\) values were downloaded for rural background
stations in Germany, 1998-2009.

We can create a `stars`

object from the `air`

matrix, the `dates`

Date vector and the `stations`

`SpatialPoints`

objects by

```
# space time
# 70 4383
```

```
d = st_dimensions(station = st_as_sfc(stations), time = dates)
(aq = st_as_stars(list(PM10 = air), dimensions = d))
```

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# PM10 0 9.92 14.8 17.7 22 274 157659
# dimension(s):
# from to offset delta refsys point
# station 1 70 NA NA +proj=longlat ... TRUE
# time 1 4383 1998-01-01 1 days Date FALSE
# values
# station POINT (9.59 53.7),...,POINT (9.45 49.2)
# time NULL
```

We can see from figure 7.10 that the time series are quite long, but also have large missing value gaps. Figure 7.11 shows the spatial distribution measurement stations and mean PM\(_{10}\) values.

```
plot(st_as_sf(st_apply(aq, 1, mean, na.rm = TRUE)), reset = FALSE, pch = 16,
ylim = st_bbox(DE)[c(2,4)])
```

`# Warning in sp::proj4string(obj): CRS object has comment, which is lost in output`

We can aggregate these station time series to area means, mostly
as a simple exercise. For this, we use the `aggregate`

method for
`stars`

objects

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# PM10 1.08 10.9 15.3 17.9 21.8 172 25679
# dimension(s):
# from to offset delta refsys point
# geometry 1 16 NA NA +proj=longlat ... FALSE
# time 1 4383 1998-01-01 1 days Date FALSE
# values
# geometry MULTIPOLYGON (((9.65 49.8, ...,...,MULTIPOLYGON (((10.8 51.6, ...
# time NULL
```

and we can now for instance show the maps for six arbitrarily chosen days (figure 7.12), using

or create a time series plot of mean values for a single state (figure 7.13) by

### 7.4.2 Example: Bristol origin-destination datacube

The data used for this example come from (Lovelace, Nowosad, and Muenchow 2019), and concern origin-destination (OD) counts: the number of persons going from region A to region B, by transportation mode. We have feature geometries for the 102 origin and destination regions, shown in figure 7.14.

```
library(spDataLarge)
plot(st_geometry(bristol_zones), axes = TRUE, graticule = TRUE)
plot(st_geometry(bristol_zones)[33], col = 'red', add = TRUE)
```

and the OD counts come in a table with OD pairs as records, and transportation mode as variables:

```
# # A tibble: 6 x 7
# o d all bicycle foot car_driver train
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 E02002985 E02002985 209 5 127 59 0
# 2 E02002985 E02002987 121 7 35 62 0
# 3 E02002985 E02003036 32 2 1 10 1
# 4 E02002985 E02003043 141 1 2 56 17
# 5 E02002985 E02003049 56 2 4 36 0
# 6 E02002985 E02003054 42 4 0 21 0
```

We see that many combinations of origin and destination are implicit zeroes, otherwise these two numbers would have been similar:

`# [1] 10404`

`# [1] 2910`

We will form a three-dimensional vector datacube with origin,
destination and transportation mode as dimensions. For this, we
first “tidy” the `bristol_od`

table to have origin (o), destination (d),
transportation mode (mode), and count (n) as variables, using `gather`

:

```
# create O-D-mode array:
bristol_tidy <- bristol_od %>%
select(-all) %>%
gather("mode", "n", -o, -d)
head(bristol_tidy)
```

```
# # A tibble: 6 x 4
# o d mode n
# <chr> <chr> <chr> <dbl>
# 1 E02002985 E02002985 bicycle 5
# 2 E02002985 E02002987 bicycle 7
# 3 E02002985 E02003036 bicycle 2
# 4 E02002985 E02003043 bicycle 1
# 5 E02002985 E02003049 bicycle 2
# 6 E02002985 E02003054 bicycle 4
```

Next, we form the three-dimensional array `a`

, filled with zeroes:

```
od = bristol_tidy %>% pull("o") %>% unique()
nod = length(od)
mode = bristol_tidy %>% pull("mode") %>% unique()
nmode = length(mode)
a = array(0L, c(nod, nod, nmode),
dimnames = list(o = od, d = od, mode = mode))
dim(a)
```

`# [1] 102 102 4`

We see that the dimensions are named with the zone names (o, d)
and the transportation mode name (mode).
Every row of `bristol_tidy`

denotes an array entry, and we can
use this to to fill the non-zero entries of the `bristol_tidy`

table
with their appropriate value (`n`

):

To be sure that there is not an order mismatch between the zones
in `bristol_zones`

and the zone names in `bristol_tidy`

, we can
get the right set of zones by:

```
order = match(od, bristol_zones$geo_code) # it happens this equals 1:102
zones = st_geometry(bristol_zones)[order]
```

(It happens that the order is already correct, but it is good practice to not assume this).

Next, with zones and modes we can create a stars dimensions object:

```
# from to offset delta refsys point
# o 1 102 NA NA WGS 84 FALSE
# d 1 102 NA NA WGS 84 FALSE
# mode 1 4 NA NA NA FALSE
# values
# o MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# d MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# mode bicycle,...,train
```

and finally build or stars object from `a`

and `d`

:

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# N 0 0 0 4.8 0 1296
# dimension(s):
# from to offset delta refsys point
# o 1 102 NA NA WGS 84 FALSE
# d 1 102 NA NA WGS 84 FALSE
# mode 1 4 NA NA NA FALSE
# values
# o MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# d MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# mode bicycle,...,train
```

We can take a single slice through from this three-dimensional
array, e.g. for zone 33 (figure 7.14), by `odm[,,33]`

,
and plot it with

`# Warning in st_as_sf.stars(x): working on the first sfc dimension only`

```
# Warning in st_bbox.dimensions(st_dimensions(obj), ...): returning the bounding
# box of the first geometry dimension
# Warning in st_bbox.dimensions(st_dimensions(obj), ...): returning the bounding
# box of the first geometry dimension
```

the result of which is shown in figure 7.15. Subsetting this way, we take all attributes (there is only one: N) since the first argument is empty, we take all origin regions (second argument empty), we take destination zone 33 (third argument), and all transportation modes (fourth argument empty, or missing).

We plotted this particular zone because it has the largest number of travelers as its destination. We can find this out by summing all origins and travel modes by destination:

`# [1] 33`

Other aggregations we can carry out include: total transportation by OD (102 x 102):

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# sum 0 0 0 19.2 19 1434
# dimension(s):
# from to offset delta refsys point
# o 1 102 NA NA WGS 84 FALSE
# d 1 102 NA NA WGS 84 FALSE
# values
# o MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# d MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
```

Origin totals, by mode:

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# sum 1 57.5 214 490 771 2903
# dimension(s):
# from to offset delta refsys point
# o 1 102 NA NA WGS 84 FALSE
# mode 1 4 NA NA NA FALSE
# values
# o MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# mode bicycle,...,train
```

Destination totals, by mode:

```
# stars object with 2 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# sum 0 13 104 490 408 12948
# dimension(s):
# from to offset delta refsys point
# d 1 102 NA NA WGS 84 FALSE
# mode 1 4 NA NA NA FALSE
# values
# d MULTIPOLYGON (((-2.51 51.4,...,...,MULTIPOLYGON (((-2.55 51.5,...
# mode bicycle,...,train
```

Origin totals, summed over modes:

Destination totals, summed over modes (we had this):

We plot `o`

and `d`

together after joining them by

the result of which is shown in figure 7.16.

There is something to say for the argument that such maps give the wrong message, as both amount (color) and polygon size give an impression of amount. To take out the amount in the count, we can compute densities (count / km\(^2\)), by

`# udunits database from /usr/share/xml/udunits/udunits2.xml`

```
a = set_units(st_area(st_as_sf(o)), km^2)
o$sum_km = o$sum / a
d$sum_km = d$sum / a
od = c(o["sum_km"], d["sum_km"], along = list(od = c("origin", "destination")))
plot(od, logz = TRUE)
```

shown in figure 7.17. Another way to normalize these totals would be to divide them by population size.

### 7.4.3 Tidy array data

The *tidy data* paper (Wickham 2014b) may suggest that such array data
should be processed not as an array, but in a long table where
each row holds (region, class, year, value), and it is always good
to be able to do this. For primary handling and storage however,
this is often not an option, because

- a lot of array data are collected or generated as array data, e.g. by imagery or other sensory devices, or e.g. by climate models
- it is easier to derive the long table form from the array than vice versa
- the long table form requires much more memory, since the space occupied by dimension values is \(O(nmp)\), rather than \(O(n+m+p)\)
- when missing-valued cells are dropped, the long table form loses the implicit indexing of the array form

To put this argument to the extreme, consider for instance that
all image, video and sound data are stored in array form; few
people would make a real case for storing them in a long table
form instead. Nevertheless, R packages like `tsibble`

take this
approach, and have to deal with ambiguous ordering of multiple records
with identical time steps for different spatial features and index
them, which is solved for both *automatically* by using the array
form – at the cost of using dense arrays, in package `stars`

.

Package `stars`

tries to follow the tidy
manifesto
to handle array sets, and has particularly developed support for the
case where one or more of the dimensions refer to space, and/or time.

## 7.5 raster-to-vector, vector-to-raster

Section 1.3 already showed some examples of raster-to-vector and vector-to-raster conversions, this section will add some code details and examples.

### 7.5.1 vector-to-raster

`st_as_stars`

is meant as a method to transform objects into `stars`

objects. However, not all stars objects are `raster`

objects,
and the method for `sf`

objects creates a vector data cube with
the geometry as its spatial (vector) dimension, and attributes
as attributes. When given a feature *geometry* (`sfc`

) object,
`st_as_stars`

will rasterize it, as shown in section 7.7,
and in figure 7.18.

`# [1] "/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg"`

Here, `st_as_stars`

can be parameterized to control cell size,
number of cells, and/or extent. The cell values returned are 0 for
cells with center point outside the geometry and 1 for cell with
center point inside or on the border of the geometry. Rasterizing
existing features is done using `st_rasterize`

, as also shown in
figure 1.4:

```
library(dplyr)
read_sf(file) %>%
mutate(name = as.factor(NAME)) %>%
select(SID74, SID79, name) %>%
st_rasterize()
```

```
# stars object with 2 dimensions and 3 attributes
# attribute(s):
# SID74 SID79 name
# Min. : 0 Min. : 0 Sampson : 655
# 1st Qu.: 3 1st Qu.: 3 Columbus: 648
# Median : 5 Median : 6 Robeson : 648
# Mean : 8 Mean :10 Bladen : 604
# 3rd Qu.:10 3rd Qu.:13 Wake : 590
# Max. :44 Max. :57 (Other) :30952
# NA's :30904 NA's :30904 NA's :30904
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 461 -84.3239 0.0192484 NAD27 FALSE NULL [x]
# y 1 141 36.5896 -0.0192484 NAD27 FALSE NULL [y]
```

Similarly, line and point geometries can be rasterized, as shown in figure 7.19.

## 7.6 Coordinate transformations and conversions

### 7.6.1 `st_crs`

Spatial objects of class `sf`

or `stars`

contain a coordinate
reference system that can be get or replaced with `st_crs`

, or be
set or replaced in a pipe with `st_set_crs`

. Coordinate reference
systems can be set with an EPSG code, like `st_crs(4326)`

which
will be converted to `st_crs('EPSG:4326')`

, or with a PROJ.4 string
like `""+proj=utm +zone=25 +south"`

, a name like “WGS84”, or a name
preceded by an authority like “OGC:CRS84”; alternatives include
a coordinate reference system definition in WKT, WKT2 (section
2.5) or PROJJSON.

The object returned contains two fields, `wkt`

with the WKT2
representation, and `input`

with the user input, if any, or else
a copy of the wkt field. Note that PROJ.4 strings can be used to
*define* some coordinate reference systems, but they cannot be used
to *represent* coordinate reference systems. Conversion of a
WKT2 in a `crs`

object to a proj4string using the `$proj4string`

method, as in

`# [1] "+proj=longlat +datum=WGS84 +no_defs"`

may succeed but is not in general lossless or invertible.

### 7.6.2 `st_transform`

, `sf_project`

Coordinate transformations or conversions (section 1.7.2)
for `sf`

or `stars`

objects are carried out with `st_transform`

,
which takes as its first argument a spatial object of class `sf`

or
`stars`

that has a coordinate reference system set, and as a second
argument with an `crs`

object (or something that can be converted
to it with `st_crs`

). When PROJ finds more than one possibility
to transform or convert from the source `crs`

to the target `crs`

,
it chooses the one with the highest declared accuracy. More fine-grained
control over the options is explained in section 7.6.5.

A lower-level function to transform or convert coordinates *not*
in `sf`

or `stars`

objects is `sf_project`

: it takes a matrix with
coordinates and a source and target `crs`

, and returns transformed
or converted coordinates.

### 7.6.3 `sf_proj_info`

Function `sf_proj_info`

can be used to query available projections,
ellipsoids, units and prime meridians available in the PROJ
software. It takes a single parameter, `type`

, which can have the
following values:

`type = "proj"`

lists the short and long names of available projections; short names can be used in a “+proj=name” string.`type = "ellps"`

lists available ellipses, with name, long name, and ellipsoidal parameters.`type = "units"`

lists the available length units, with conversion constant to meters`type = "prime_meridians"`

lists the prime meridians with their position with respect to the Greenwich meridian.

### 7.6.4 proj.db, datum grids, cdn.proj.org, local cache

Datum grids (section 1.7.2) can be installed locally, or be read from the PROJ datum grid CDN at https://cdn.proj.org/. If installed locally, they are read from the PROJ search path, which is shown by

`# [1] "/home/edzer/.local/share/proj" "/usr/share/proj"`

The main PROJ database is `proj.db`

, an sqlite3 database typically
found at

`# [1] "/usr/share/proj/proj.db"`

which can be read. The version of the snapshot of the EPSG database
included in each PROJ release is stated in the `"metadata"`

table of
`proj.db`

; the version of the PROJ runtime used by **sf** is shown by

```
# PROJ
# "7.2.1"
```

If for a particular coordinate transformation datum grids are not locally found, PROJ will search for online datum grids in the PROJ CDN when

`# [1] FALSE`

returns `TRUE`

. By default it is set to `FALSE`

, but

`# [1] "https://cdn.proj.org"`

sets it to `TRUE`

and returns the URL of the network resource used;
this resource can also be set to another resource, that may be
faster or less limited.

After querying a datum grid on the CDN, PROJ writes the *portion* of
the grid queried (not, by default, the entire grid) to a local cache,
which is another sqlite3 database found locally in a user directory,
e.g. at

`# [1] "/home/edzer/.local/share/proj/cache.db"`

that will be searched first in subsequent datum grid queries.

### 7.6.5 transformation pipelines

Internally, PROJ uses a so-called *coordinate operation pipeline*,
to represent the sequence of operations to get from a source CRS to
a target CRS. Given multiple options to go from source to target,
`st_transform`

chooses the one with highest accuracy. We can query
the options available by

```
# Candidate coordinate operations found: 5
# Strict containment: FALSE
# Axis order auth compl: FALSE
# Source: EPSG:4326
# Target: EPSG:22525
# Best instantiable operation has accuracy: 2 m
# Description: axis order change (2D) + Inverse of Corrego Alegre 1970-72 to
# WGS 84 (2) + UTM zone 25S
# Definition: +proj=pipeline +step +proj=unitconvert +xy_in=deg +xy_out=rad
# +step +inv +proj=hgridshift
# +grids=br_ibge_CA7072_003.tif +step +proj=utm
# +zone=25 +south +ellps=intl
```

and see that pipeline with the highest accuracy is summarised; we can see that it specifies use of a datum grid. Had we not switched on the network search, we would have obtained a different result:

`# character(0)`

```
# Candidate coordinate operations found: 5
# Strict containment: FALSE
# Axis order auth compl: FALSE
# Source: EPSG:4326
# Target: EPSG:22525
# Best instantiable operation has accuracy: 5 m
# Description: axis order change (2D) + Inverse of Corrego Alegre 1970-72 to
# WGS 84 (4) + UTM zone 25S
# Definition: +proj=pipeline +step +proj=unitconvert +xy_in=deg +xy_out=rad
# +step +proj=push +v_3 +step +proj=cart
# +ellps=WGS84 +step +proj=helmert +x=206.05
# +y=-168.28 +z=3.82 +step +inv +proj=cart
# +ellps=intl +step +proj=pop +v_3 +step +proj=utm
# +zone=25 +south +ellps=intl
# Operation 4 is lacking 1 grid with accuracy 2 m
# Missing grid: br_ibge_CA7072_003.tif
# URL: https://cdn.proj.org/br_ibge_CA7072_003.tif
```

and a report that a datum grid is missing
The object returned by `sf_proj_pipelines`

is a sub-classed `data.frame`

, with columns

```
# [1] "id" "description" "definition" "has_inverse" "accuracy"
# [6] "axis_order" "grid_count" "instantiable" "containment"
```

and we can list for instance the accuracies by

`# [1] 5 5 8 2 NA`

Here, `NA`

refers to “ballpark accuracy”, which may be anything in the
30-100 m range:

```
# Candidate coordinate operations found: 1
# Strict containment: FALSE
# Axis order auth compl: FALSE
# Source: EPSG:4326
# Target: EPSG:22525
# Best instantiable operation has only ballpark accuracy
# Description: axis order change (2D) + Ballpark geographic offset from WGS 84
# to Corrego Alegre 1970-72 + UTM zone 25S
# Definition: +proj=pipeline +step +proj=unitconvert +xy_in=deg +xy_out=rad
# +step +proj=utm +zone=25 +south +ellps=intl
```

The default, most accurate pipeline chosen by `st_transform`

can
be overriden by specifying `pipeline`

argument, as selected from
the set of options in `p$definition`

.

## 7.7 transforming and warping rasters

When using `st_transform`

on a raster data set, as e.g. in

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# L7_ETMs.tif 1 54 69 68.9 86 255
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 NA NA WGS 84 FALSE [349x352] -34.9165,...,-34.8261 [x]
# y 1 352 NA NA WGS 84 FALSE [349x352] -8.0408,...,-7.94995 [y]
# band 1 6 NA NA NA NA NULL
# curvilinear grid
```

we see that a *curvilinear* is created, which means that for every
grid cell the coordinates are computed in the new CRS, which no
longer form a *regular* grid. Plotting such data is extremely slow,
as small polygons are computed for every grid cell and then plotted.
The advantage of this is that no information is lost: grid cell
values remain identical after the projection.

When we start with a raster on a regular grid and want to obtain
a *regular* grid in a new coordinate reference system, we need
to *warp* the grid: we need to recreate a grid at new locations,
and use some rule to assign values to new grid cells. Rules can
involve using the nearest value, or using some form of interpolation.
This operation is not lossless and not invertible.

The best options for warping is to specify the target grid as a
`stars`

object. When only a target CRS is specified, default options
for the target grid are picked that may be completely inappropriate
for the problem at hand. An example workflow that uses only a
target CRS is

```
# from to offset delta refsys point values x/y
# x 1 350 -34.9166 0.000259243 WGS 84 NA NULL [x]
# y 1 352 -7.94982 -0.000259243 WGS 84 NA NULL [y]
# band 1 6 NA NA NA NA NULL
```

which creates a pretty close raster, but then the transformation is also relatively modest. For a workflow that creates a target raster first, here with exactly the same number of rows and columns as the original raster one could use:

```
r = read_stars(tif)
grd = st_bbox(r) %>%
st_as_sfc() %>%
st_transform(4326) %>%
st_bbox() %>%
st_as_stars(nx = dim(r)["x"], ny = dim(r)["y"])
st_warp(r, grd)
```

```
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# L7_ETMs.tif 1 54 69 68.9 86 255 6180
# dimension(s):
# from to offset delta refsys point values x/y
# x 1 349 -34.9166 0.000259666 WGS 84 NA NULL [x]
# y 1 352 -7.94982 -0.000258821 WGS 84 NA NULL [y]
# band 1 6 NA NA NA NA NULL
```

## 7.8 Exercises

- NDVI, normalized differenced vegetation index, is coputed as (NIR-R)/(NIR+R), with NIR the near infrared and R the red band. Read the
`L7_ETMs.tif`

file into object`x`

, and distribute the band dimensions over attributes by`split(x, "band")`

. Then, compute NDVI by using an expression that uses the NIR (band 4) and R (band 3) attributes directly. - Compute NDVI for the S2 image, using
`st_apply`

and an a function`ndvi = function(x) (x[4]-x[3])/(x[4]+x[3])`

. Plot the result, and write the result to a GeoTIFF. Explain the difference in runtime between plotting and writing. - Use
`st_transform`

to transform the`stars`

object read from`L7_ETMs.tif`

to EPSG 4326. Print the object. Is this a regular grid? Plot the first band using arguments`axes=TRUE`

and`border=NA`

, and explain why this takes such a long time. - Use
`st_warp`

to warp the`L7_ETMs.tif`

object to EPSG 4326, and plot the resulting object with`axes=TRUE`

. Why is the plot created much faster than after`st_transform`

?

### References

Eddelbuettel, Dirk. 2013. *Seamless R and C++ Integration with Rcpp*. Springer.

Herring, John, and others. 2011. “Opengis Implementation Standard for Geographic Information-Simple Feature Access-Part 1: Common Architecture [Corrigendum].”

Hijmans, Robert J. 2020. *Raster: Geographic Data Analysis and Modeling*. https://rspatial.org/raster.

Hijmans, Robert J. 2021. *Terra: Spatial Data Analysis*. https://rspatial.org/terra.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. *Geocomputation with R*. Chapman; Hall/CRC. https://geocompr.robinlovelace.net/.

Pebesma, Edzer. 2012. “spacetime: Spatio-Temporal Data in R.” *Journal of Statistical Software* 51 (7): 1–30. https://www.jstatsoft.org/v51/i07/.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” *The R Journal* 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

Plate, Tony, and Richard Heiberger. 2016. *Abind: Combine Multidimensional Arrays*. https://CRAN.R-project.org/package=abind.

Wickham, Hadley. 2014b. “Tidy Data.” *Journal of Statistical Software* 59 (1). https://www.jstatsoft.org/article/view/v059i10.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” *Journal of Open Source Software* 4 (43): 1686.

Wickham, Hadley, and Garret Grolemund. 2017. *R for Data Science*. O’Reilly. http://r4ds.had.co.nz/.