In this post, we will learn univariate linear dynamic spatio-temporal models (DSTM) in discretized time context.

In previous posts, we have learned the basics of time series analysis. The DSTM could be viewed as a time series of spatial processes. DSTMs provide powerful insights into causative mechanisms, as it allows us not only to make predictions of spatial processes but also to make inference on parameters of models of underlying mechanistic.

The data model and process model are respectively (1) and (2),

Zt()=Ht(Yt(),θd,t,ϵt()),t=1,,T

Yt()=M(Yt1(),θp,t,ηt()),t=1,2,

There are two key assumptions that are usually made for DSTMs.

  1. Zt() are independent in time when conditioned on the true process Yt() and the parameters θd,t. The join distribution of the data conditioned on the process and parameters can be represented in product form.

  2. The Markov assumption: conditioned on the past, only the recent past is important to explain the present. Under the first-order Markov assumption that the process at time t conditioned on all of the past is only dependent on the most recent past.

[Y0(),Y1(),,YT()|θp,t]=(Tt=1[Yt()|Yt1(),θp,t])[Y0()|θp,0]

For the process model (2), it can be linear or nonlinear and the associated conditional distribu-tion, [Yt()|Yt1()] can be Gaussian or non-Gaussian. As in autoregressive modeling in time series, one can make higher-order Markov assumptions.

θd,t,θp,t are respectively the parameters of the data and process model. The primary challenge in DSTMs is to effectively reduce the parameter space.

Latent Linear Gaussian DSTMs

The authors illustrate DSTMs using the simplest but most widely use situation, in which the process operator M in (2) is assumed to be linear with additive Gaussian error distribution.

We are interested in a Latent process {Yt(si)} at a set of locations given by {si:i=1,,n}, and we have data/observations at locations {rjt:j=1,,mt;t=0,1,,T}. Note that there could be a different number of data locations at different observation time, but it is assumed that there is finite set of m possible data locations to be considered; so mtm. If there are infinite possible data locations, then the location itself becomes a random variable, and it is a possion point process, which is not covered in this book. The locations can be either point or areal.

Linear Data Model with Additive Gaussian Error

Consider the mt-dimensional data vector, Zt(Zt(r1t),,Zt(rmtt)), and the n-dimensional latent-process vector, Yt(Yt(s1),,Yt(sn)). We aim to infer the latent-process vector.

Consider the jth observation at time t, the linear data model with additive Gaussian error can be written as:

Zt(rjt)=bt(rjt)+ni=1ht,jiYt(si)+ϵt(rjt)

The vector-matrix form is:

Zt=bt+HtYt+ϵt,ϵtGau(0,Cϵ,t)

where bt is the mt-dimensional offset term, Ht is the mt×n mapping matrix, which is typically assumed to be known. Cϵ,t is an mt×mt error covariance matrix. It can generally include dependence in space or time, although we typically assume that the errors are independent in time. Well, in practice, given that most of the dependence structure in the observations is contained in the process, the structure of Cϵ,t should be very simple. It is often assumed that these data-model errors are independent with constant variance in time and space, so that Cϵ,t=σ2ϵImt.

Non-Gaussian and Nonlinear Data Model

Non Gaussian data model use a transformation of the mean response g(Yt(s))ˆYt(s). We can also include a mapping matrix to the non-Gaussian data model.

Zt|Yt,γEF(HtYt,γ)

To consider a nonlinear transformation of the latent process (even if the error time is Gaussian), the simplest way is to add a power transformation applied to each element of Yt.

Zt=bt+HtYat+ϵt,ϵtGau(0,Cϵ,t)

In some applications it is reasonable to assume that the transformation power a vary with space or time and may depend on covariates.

Process Model

Linear Markovian spatio-temporal process models generally assume that the value of the process at a given location at the present time is made up of a weighted combination (or is a “smoothed version”) of the process throughout the spatial domain at previous times, plus an additive, Gaussian, spatially coherent “innovation”.

A first-order spatio-temporal inegro-difference equation (IDE) process model is:

Yt(s)=Dsm(s,x;θp)Yt1(x)dx+ηt(s),s,xDs

where m(s,x;θp) is a transition kernel. Notably, the asymmetry and rate of decay of the transition kernel m controls propagation(linear advection) and spread (diffusion) respectively. Spatially coherent disturbances tend to spread across space at a greater rate when the kernel is wider, which leads to more averaging from one time to the next. The offset kernel pulls information from one particular direction and redistributes it in the opposite direction, leading to propagation.

It is often assumed that θp does not vary with time. In (7), it is assumed that the process Yt() has mean zero. However, it sometimes may be more appropriate to model a non-zero mean, which will be discussed later on. In general, Dsm(s,x;θp)dx<1 is required for the process to be stable (non-explosive) in time.

When we consider a finite set of prediction spatial locations or regions (e.g., an irregular lattice or a regular grid), the first order IDE (7) can be discretized as a stochastic difference equation,

Yt(si)=nj=1mij(θp)Yt1(sj)+ηt(si)

Now, defining the process vector Yt(Yt(s1),,Yt(sn)), (8) can be written in vector-matrix form as:

Yt=MYt1+ηt

The stability condition requires that the maximum modulus of the eigenvalues of M (which may be complex-valued) be less than 1. If one fits an unconstrained linear DSTM to data that come from a nonlinear process (many real-world spatial-temporal processes are nonlinear), then the fitted model is unstable, that is, explosive with exponential growth. It indicates that the wrong model is being fitted or that the finite-time window for the observations suggests a transient period of growth. Long-lead-time forecasts from such a model are problematic, but for short-term forecasts, it is actually helpful as the mean of the predictive distribution averages over realizations that are both explosive and non-explosive.

transient growth is due to that the transition operator M is stable (the maximum modulus of the eigenvalues of M is less than 1) but M is “non-normal” (in discrete-space case, if MMMM, in which case, the eigenvectors of M are non-orthogonal).

Process and Parameter Dimension Reduction

The latent linear Gaussian DSTM has unknown parameters associated with the data model Cϵ, the transition operator m(s,x;θp) and the initial-condition distribution (e.g., μ0 and C0). With the linear Gaussian data model, one typically considers a simple parameterization of Cϵ such as Cϵ=σ2ϵI. Or we can use the covariance matrix implied by a simple spatial random process that has just a few parameters such as a matern spatial covariance function or a spatial condtional autoregressive process.

One of the biggest challenges in DSTMs in hierachical modeling settings is the curse of dimensionality associated with the process model. A common situation is the number of locations n is much larger than the number of time T. The DSTM process model will be problematic as there are on the order of n2 parameters to estimate. Two approaches are discussed in the book.

Parameter Dimension Reduction

The process-error spatial variance-covariance matrix Cη can be represented by one of the common spatial covariance function or a basis-function random effects. No need to estimate the full positive definite matrix in the DSTM.

The authors highlighted that the transition matrix parameters could have as many as n2 parameters and thereby need extra care.For a simple linear DSTM (9), we could parameterize the transition matrix as:

  1. a random walk; M=I
  2. a spatially homogeneous autoregressive process; M=θpI
  3. a spatially varying autoregressive process; M=diag(θp)

The third parameterization is more realistic and useful for real-world processes than the first two. Based on this parameterization, where Cη=σ2ηI and M=diag(θp), we can decompose the first-order conditional distribution as:

[Yt|Yt1,θp,σ2η]=ni=1[Yt(si)|Yt1(si),θp(i),σ2η],t=1,2,

Conditional on the parameters θp=(θp(1),,θp(n)), we have spatially independent univariate AR(1) processes at each spatial location. However, it is worth noting that, if θp is random and has spatial dependence, then the marginal conditional distribution [Yt|Yt1,σ2η] after we integrate θp out, can imply that all of the elements of Yt1 influence the transition to time t at all spatial locations (i.e., this is non-separable spatio-temporal process). For forecasting applications, we often seek parameterizations that directly include interactions in the conditional model.

The authors again pointed that the transition kernel is important. We can model realistic linear behavior by parameterizing the kernel shape (decay and asymmetry) in terms of the kernel width, variance, and shift, or mean. And if we let these parameters to change with space then we can model quite complex dynamics with a relatively small number of parameters. For example, the IDE process model given by (7) can be specifed with a Gaussian-shape transition kernel as a function of x relative to the location s in a 1D spatial domain (for simplicity).

m(s,x;θp)=θp,1exp(1θp,2(xθp,3s)2)

On the right side of the above equation, the three parameters about θp are respectively the kernel amplitude, variance (kernal scale), an mean (shift) relative to location s. (11) is positive nut need not integrate to 1 over x. Recall the wider kernels suggest faster decay, and postive shift leads to leftward movement. To obtain more complex dynamical behavior, we can allow these parameters to change with space. For example, let the mean (shift) parameter satisfies: θp,3=x(s)β+w(s), where x(s) corresponds to covariates at sptial location s, and β are the regression parameters. The authors also mentioned that we can vary θp,2 but θp,3 is often more important.

Although the IDE kernel is feasible for continous space, three are many occasions that we need efficient parameterizations in (1) a discrete-space setting or (2) in the context of random effects in basis-function expansions. The authors then discussed in discrete space the transition operators only consider local spatial neighborhoods.

Lagged-Nearest-Neighbor (LNN) Representation

For discrete space, a very parsimonious yet realistic dynamical model can be specifed as:

Yt(si)=sjNimijYt1(sj)+ηt(si)

Where Ni corresponds to a pre-specifed neighborhood of the location si. For those who are familiar with spatial regression and sptial weights, it is easy to understand how neighborhoods of locations can weight Yt1 to Yt. The parameters can be further parameterized to account for decay (spread and diffusion) rate and asymmetry (propagation direction). Of course, we can also let the parameters vary in space.

Motivation of an LNN with a Mechanistic Model

The LNN parameterization can be motivated by mechanistic models such as integro-diffential or partial differential equations (PDEs). In PDE, the parameters mij in (12) can be parameterized in terms of knowledge of mechanistics, such as spatially varying diffusion or advection coefficients. Consider the basic linear, non-random, advection-diffusion PDE,

Yt=a2Yx2+b2Yx2+uYx+vYy

where a and b are diffusion coefficients that control the rate of spread, the u and v are advection parameters that control the flow. Simple finite-difference discretization of such PDEs on a 2D equally spaced finite grid can lead to LNN specifications of the form:

Yt=M(θp)Yt1+Mb(θp)Yb,t+ηt

Yt corresponds to non-boundary grid points, and Yb,t are boundary process. ηt is assumed to be Gaussian, mean-zero, and independent in time. These parameters can vary with space as well, and we model them either in terms of covariates or as a spatial random process.

Dimension Reduction in the Process Model

In the previous post, we have learned to use spatio-temporal random effect model to reduce process dimensionality. This is particularly helpful for DSTM process model.

Consider an extension to the spatial basis-function mixed-effects model,

Yt(s)=xt(s)β+nαi=1ϕi(s)αi,t+nξj=1ψj(s)ξj,t+νt(s)

where αi,t are the dynamically evolving random coefficients, and ξj,t are non-dynamical. Both basis functions ϕi() and ψj() are assumed known.νt() is assumed to be a Gaussian process with mean zero and independent in time.

The vector form of (15) is:

Yt=Xtβ+Φαt+Ψξt+νt

where Xt is an n×(p+1) matrix that could be time-varying and can be interpreted as a spatial offset corresponding to large-scale non-dynamical feastures or covariate effects. Φ is an n×nalpha matrix of basis vectors, and Ψ is an n×nξ matrix of basis vectors. αt and ξt are respectively the latent dynamical and non-dynamical coefficients.

The dynamical coefficient process αt can evolve according to the linear equation. For example, we can specify a first-order vector autoregressive model (VAR(1)),

αt=Mααt1+ηt

Again, the transition operator has to be non-normal (i.e., MαMαMαMα), as almost all real-world linear processes correspond to non-normal transition operators.

Importantly, the notion of “neighbors” is not always well defined in these formulations. If the basis functions given in Φ are global basis functions such as some types of splines, Fourier, EOFs, etc., the elements of αt are not spatially indexed.

The authors then discussed the choice of basis functions. For DSTMs, it is usually important to specify basis functions such that interactions across spatial scales can accommodate transient growth. This is difficult to achieve in “knot-based” representations such as B-splines, kernel convolutions, predictive processes, where αt are spatially referenced by not multi-resolutional. The dynamical evolution in the DSTM can accommodate scale interactions by using global basis functions such as EOFs, Fouries, etc.

Nonlinear DSTMs

Many mechanistic processes are nonlinear at spatial and temporal scales of variability. We can write the nonlinear spatio-temporal AR(1) process as:

Yt()=M(Yt1(),ηt();θp),t=1,2,

Here, M is a nonlinear function. There are an infinite number of nonlinear statistical models. We can either (1) take a nonparameteric view of the problem and learn the dynamics from the data, or (2) propose specific model classes that can accommodate the dynamical behaviors.

State-Dependent Models

State-Dependent models consider that the transition matrix M depend on the process (state) value Yt1 at each time and parameters θp which may vary with time and space.

Yt=M(yt1;θp)Yt1+ηt

One type of state-dependent model is the threshold vector autoregressive model,

Yt={M1Yt1+η1,t,if f(wt)d1MKYt1+ηK,tif f(wt)dK

where f(wt) is a function of a time-varying parameter wt that can itself be a function of the process Y(t1).

The transition matrices {M1,,Mk} and error covariance matrices {Cη1,,CηK} depend on unknown parameters. A big challeng is to reduce the dimensionality of the parameter space.

General Quadratic Nonlinearity

A large number of real-world processes in the physical and biological sciences exhibit quadratic interactions. Consider the following one-dimensional reaction-diffusion PDE:

Yt=x(δYx)diffusion term linear in Y+Yexp(γ0(1Yγ1))

where the first term corresponds to a diffusion (spread) term that depends on δ, and the seccond term corresponds to a density-dependent growth term with growth parameter γ0 and carrying capacity parameter γ1. Each of these parameters can vary with space and time. The first term is linear in Y but the second term is nonlinear.

In discrete space and time, a general quadratic nonlinear (GQN) DSTM can be written as:

Yt(si)=nj=1mijYt1(sj)+nk=1nl=1bi,klg(Yt1(sl);θg)Yt1(sk)+ηt(si)

bi,kl corresponds to the quadratic interaction transition coefficients. The function g() transforms one of the components of the qaudratic interaction. This transformation is important for many processes such as epidemic or invasive-species population processes. The conditional GQN for Yt() on Yt1() is Gaussian, but the marginal model for Y_t(\cdot) is in general not Gaussian due to the non-linear interactions. GQN is a special case of the state-dependent model.

Some other Nonlinear Models

The authors introduced other promising approaches for non-linear spatio-temporal modeling.

  1. Variants of neural networks (e.g., RNNs). The orignal formulations do not address uncertainty quantification. However, there is increasing interest in considering deep learning models within broader uncertainty-based paradigms. We will have a new series of posts learning deep learning related spatio-temporal modeling.
    • due to high dimensionality of the hidden states and parameters, it typically requires complex regularization or prior information to make them work.
    • echo state network (ESN) consider sparsely connected hidden layers that allow for sequential interactions yet assume most of the parameters are randomly generated and then fixed, with only parameters estimated being those that connected to the hidden layer to the response. However, ESN does not include quadratic interactions or formal uncertainty quantification.
  2. agent-based model, the process is built from local individual-scale interactions by simple rules that lead to complex nonlinear behavior. ALthough these models are parsimonious but are computational expensive, and paramter inference can be challenging.

  3. “mechanism-free” approach, so called “analogs”. It seeks to final historical sequences of analogs that match a similar sequence culminating at the current time.

One of the fundamental differences between DSTMs and multivariate time series models is that DSTMs require scalable parameterization of the evo- lution model.