# Chapter 13 Multivariate and Spatiotemporal Geostatistics

Building on the simple interpolation methods presented in chapter 12, this chapter works out a case study for spatiotemporal interpolation, using NO$$_2$$ air quality data, and populations density as covariate.

## 13.1 Preparing the air quality dataset

The dataset we work with is an air quality dataset obtained from the European Environmental Agency (EEA). European member states report air quality measurements to this Agency. So-called validated data are quality controlled by member states, and are reported on a yearly basis. They form the basis for policy compliancy evaluations.

The EEA’s air quality e-reporting website gives access to the data reported by European member states. We decided to download hourly (time series) data, which is the data primarily measured. A web form helps convert simple selection criteria into an http GET request. The following URL

was created to select all validated (Source=E1a) $$NO_2$$ (Pollutant=8) data for 2017 (Year_from, Year_to) from Germany (CountryCode=DE). It returns a text file with a set of URLs to CSV files, each containing the hourly values for the whole period for a single measurement station. These files were downloaded and converted to the right encoding using the dos2unix command line utility.

In the following, we will read all the files into a list,

then convert the time variable into a POSIXct variable, and put them in time order by

We remove smaller subdatasets, which for this dataset have no hourly data:

and then combine all files using xts::cbind, so that they are matched based on matching times:

A usual further selection for this dataset is to select stations for which 75% of the hourly values measured are valid, i.e. drop those with more than 25% hourly values missing. Knowing that mean(is.na(x)) gives the fraction of missing values in a vector x, we can apply this function to the columns (stations):

Next, the station metadata was read and filtered for rural background stations in Germany ("DE") by

These stations contain coordinates, and an sf object with (static) station metadata is created by

We now subset the air quality measurements to include only stations that are of type rural background:

We can compute station means, and join these to stations locations by

Station mean NO$$_2$$ concentrations, along with country borders, are shown in in figure 12.1.

## 13.2 Multivariable geostatistics

Multivariable geostatics involves the joint modelling, prediction and simulation of multiple variables, $Z_1(s) = X_1 \beta_1 + e_1(s)$ $...$ $Z_n(s) = X_n \beta_n + e_n(s).$ In addition to having observations, trend models, and variograms for each variable, the cross variogram for each pair of residual variables, describing the covariance of $$e_i(s), e_j(s+h)$$, is required. If this cross covariance is non-zero, knowledge of $$e_j(s+h)$$ may help predict (or simulate) $$e_i(s)$$. This is especially true if $$Z_j(s)$$ is more densely sample than $$Z_i(s)$$. Prediction and simulation under this model are called cokriging and cosimulation. Examples using gstat are found when running the demo scripts

and are further illustrated and discussed in Bivand, Pebesma, and Gomez-Rubio (2013).

In case the different variables considered are observed at the same set of locations, for instance different air quality parameters, then the statistical gain of using cokriging as opposed to direct (univariable) kriging is often modest, when not negligible. A gain may however be that the prediction is truly multivariable: in addition to the prediction vector $$\hat{Z(s_0)}=(\hat{Z}_1(s_0),...,\hat{Z}_n(s_0))$$ we get the full covariance matrix of the prediction error (Ver Hoef and Cressie 1993). This means for instance that if we are interested in some linear combination of $$\hat{Z}(s_0)$$, such as $$\hat{Z}_2(s_0) - \hat{Z}_1(s_0)$$, that we can get the standard error of that combination because we have the correlations between the prediction errors.

Although sets of direct and cross variograms can be computed and fitted automatically, multivariable geostatistical modelling becomes quickly hard to manage when the number of variables gets large, because the number of direct and cross variograms required is $$n(n+1)/2$$.

In case different variables refer to the same variable take at different time steps, one could use a multivariable (cokriging) prediction approach, but this would not allow for e.g. interpolation between two time steps. For this, and for handling the case of having data observed at many time instances, one can also model its variation as a function of continuous space and time, as of $$Z(s,t)$$, which we will do in the next section.

## 13.3 Spatiotemporal geostatistics

Spatiotemporal geostatistical processes are modelled as variables having a value everywhere in space and time, $$Z(s,t)$$, with $$s$$ and $$t$$ the continuously indexed space and time index. Given observations $$Z(s_i,t_j)$$ and a variogram (covariance) model $$\gamma(s,t)$$ we can predict $$Z(s_0,t_0)$$ at arbitrary space/time locations $$(s_0,t_0)$$ using standard Gaussian process theory.

Several books have been written recently about modern approaches to handling and modelling spatiotemporal geostatistical data, including Wikle, Zammit-Mangion, and Cressie (2019) and Blangiardo and Cameletti (2015). Here, we will use Gräler, Pebesma, and Heuvelink (2016) and give some simple examples using the dataset also used for the previous chapter.

### 13.3.1 A spatiotemporal variogram model

Starting with the spatiotemporal matrix of NO$$_2$$ data in aq constructed at the beginning of this chapter, we will first select the measurements taken at rural background stations:

Then we will select the spatial locations for these stations by

and finally build a stars object with time and station as dimensions:

From this, we can compute the spatiotemporal variogram using

# Warning: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead,
# explicitly specify either 'multisession' or 'multicore'. In the current R
# session, 'multiprocess' equals 'multicore'.

which is shown in figure 13.1.

To this sample variogram, we can fit a variogram model. One relatively flexible model we try here is the product-sum model (Gräler, Pebesma, and Heuvelink 2016), fitted by

# space component:
#   model psill range
# 1   Nug  26.3     0
# 2   Exp 140.5   432
# time component:
#   model psill range
# 1   Nug  1.21   0.0
# 2   Sph 15.99  40.1
# k: 0.0322469094848839

and shown in figure 13.2.

which can also be plotted as wireframes, shown in figure 13.3.

Hints about the fitting strategy and alternative models for spatiotemporal variograms are given in Gräler, Pebesma, and Heuvelink (2016).

With this fitted model, and given the observations, we can carry out kriging or simulation at arbitrary points in space and time. For instance, we could estimate (or simulate) values in the time series that are now missing: this occurs regularly, and in section 12.4 we used means over time series based on simply ignoring up to 25% of the observations: substituting these with estimated or simulated values based on neigbouring (in space and time) observations before computing yearly mean values seems a more reasonable approach.

More in general, we can estimate at arbitrary locations and time points, and we will illustrate this with predicting time series at particular locations, and and predicting spatial slices (Gräler, Pebesma, and Heuvelink 2016). We can create a stars object for two randomly picked spatial points and all time instances by

and we obtain the spatiotemporal predictions at these two points using krigeST by

where the results are shown in figure 13.4.

Alternatively, we can create spatiotemporal predictions for a set of time-stamped raster maps, evenly spaced over the year 2017, created by

and the subsequent predictions are obtained by

# Coordinate system already present. Adding new coordinate system, which will replace the existing one.

and shown in figure 13.5.

A larger value for nmax was needed here to decrease the visible disturbance (sharp edges) caused by discrete neighbourhood selections, which are now done in space and time.

## 13.4 Exercises

1. Which fraction of the stations is removed in section 13.1 when the criterion applied that a station must be 75% complete?
2. From the hourly time series in no2.st, compute daily mean concentrations using aggregate, and compute the spatiotemporal variogram of this. How does it compare to the variogram of hourly values?
3. Carry out a spatiotemporal interpolation for daily mean values for the days corresponding to those shown in 13.5, and compare the results.
4. Following the example in the demo scripts pointed at in section 13.2, carry out a cokriging on the daily mean station data for the four days shown in 13.5. What are the differences of this approach to spatiotemporal kriging?

### References

Bivand, Roger S., Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. http://www.asdar-book.org/.

Blangiardo, Marta, and Michela Cameletti. 2015. Spatial and Spatio-Temporal Bayesian Models with R-Inla. John Wiley & Sons.

Gräler, Benedikt, Edzer Pebesma, and Gerard Heuvelink. 2016. “Spatio-Temporal Interpolation using gstat.” The R Journal 8 (1): 204–18. https://doi.org/10.32614/RJ-2016-014.

Ver Hoef, Jay M, and Noel Cressie. 1993. “Multivariable Spatial Prediction.” Mathematical Geology 25 (2): 219–40.

Wikle, Christopher K, Andrew Zammit-Mangion, and Noel Cressie. 2019. Spatio-Temporal Statistics with R. CRC Press.