Skip to document

Lecture-notes-complete

ALL Lecture notes for masters students taking Spatial Statistics
Module

Spatial Statistics (STATS5012)

24 Documents
Students shared 24 documents in this course
Academic year: 2020/2021
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Glenforest Secondary School

Comments

Please sign in or register to post comments.

Preview text

Spatial Statistics 2020 - 4H and 5M

Contact details and sources of help

Lecturer - Professor Duncan Lee Email - Duncan@glasgow.ac

Office location - Room 342, Mathematics and Statistics building

Course webpage - moodle.gla.ac/course/view.php?id= Online help forum - padlet/duncan_lee1/Spatialstatistics

Course structure

Lectures - 20 in total at a rate of 2 per week.

Tutorials - 4 in total during the term. Computer labs - 2 in total during the term.

Assessment

100% exam for both 4H and 5M variants in the April/May exam diet.

4H - 90 minute exam

5M - 120 minute exam

Past exam papers are on the Moodle course page.

Sources of help

If you have any problems with the material in the course then there are a number of options for getting help.

  1. Ask questions at the tutorial sessions, that is why we have them!

  2. Consult the course webpage on Moodle or leave a post on the online forum.

  3. Look in a book in the library.

  4. Come and see me outside lectures, probably best to send an email to arrange a time as I have other courses I teach and am not always free.

Further reading

Although the course notes are fully self contained, there are a large number of books on spatial statistics, some of which can be found in the library. Here are just a few.

  • Diggle, P and Ribeiro, P. Model-based Geostatistics, Springer (2007).

  • Bivand, R and Pebesma, E and Gomez-Rubio, V. Applied Spatial Data Analysis with R, Springer (2013).

  • Banerjee, S and Carlin, B and Gelfand, A. Hierarchical Modelling and Analysis for Spatial Data, CRC, (2011).

Course aims

To introduce the three main types of spatial data, namely geostatistics, areal unit data and point processes, and describe how:

  • to identify trends and spatial autocorrelation.

Background knowledge

The following are things you should know from your previous courses, and we will use them repeatedly in this course. Let X be a continuous random variable. Then its mean, variance and standard deviation are given by

Mean E[ X ] =

∫∞

−∞

xf ( x )d x.

Variance Var[ X ] =E[( X −E[ X ]) 2 ] =E[ X 2 ]−E[ X ] 2_._

Standard deviation Sd[ X ] =

√ Var[ X ].

Now let X and Y be two continuous random variables. Then their covariance and correlation are defined by

Covariance

Cov[ X,Y ] =E[( X −E[ X ])( Y −E[ Y ])] =E[ XY ]−E[ X ]E[ Y ].

Correlation

Corr[ X,Y ] =

Cov[ X,Y ] √ Var[ X ]Var[ Y ]

,

where the correlation is restricted to lie between -1 and 1. The following properties hold for random variables.

E[ aX + bY ] = a E[ X ] + b E[ Y ] Var[ aX + bY ] = a 2 Var[ X ] + b 2 Var[ Y ] + 2 ab Cov[ X,Y ] Cov[ X,X ] = Var[ X ] Cov[ a + X,Y ] = Cov[ X,Y ] Cov[ X,Y ] = Cov[ Y,X ] Cov[ aX,Y ] = a Cov[ X,Y ] Cov[ a,X ] = 0 Cov[ X + Y,V + W ] = Cov[ X,V ] + Cov[ X,W ]

  • Cov[ Y,V ] + Cov[ Y,W ].

The other rules you need to know are the ones about iterated expectations and variances, namely that for random variables X and Y :

E( X ) = E Y (E X ( X | Y ))

Var( X ) = E Y (Var X ( X | Y )) + Var Y (E X ( X | Y )) ,

where in each case the inner expectation or variance is with respect to the conditional distribution of X | Y , while the outer expectation or variance is with respect to the marginal distribution of Y.

Definition

Aspatial stochastic processis a family of random variables

{ Z ( s ) : sD } ,

indexed by spatial locations sD. Here:

  • D is the spatial domain of the process, that is the geographical region in which observations could be made; and
  • Z ( s ) is a random variable representing the quantity that you measure at location s.

1 Three types of spatial data

There are three different types of spatial data, and the statistical models used differ for each type. The three types differ largely in their definition of the spatial domain D. 1. Geostatistical processes:

The spatial domain D is a continuous 2-dimensional region and data could,in theory, be observed at infinitely many locations sD. However, in practice observations are made at n locations chosen by the person collecting the data.

  1. Areal processes: The spatial domain is a continuous 2-dimensional region which has been split into n non-overlapping sub-regions D ={ B 1 ,...,Bn }, where Bi is the i th sub-region. Here, data are observed for each of the n sub-regions.

  2. Point process:

The spatial domain D is a random set inR 2 , and the locations of the observations are themselves the data. That is, the number and the locations of the data points are random rather than being specified by the person collecting the data.

Example

What types of spatial data are the following:

  1. The locations of trees in a forest.

  2. The percentage of people unemployed in each local authority in Scotland.

  3. The concentrations of air pollution in Glasgow.

Well we have that:

  1. Point process - the locations are random and not chosen.

  2. Areal process - Scotland is partitioned into a number of local authorities.

  3. Geostatistical process - concentrations could be measured at any location in Glasgow.

1 Geostatistical processes

Geostatistical processes are also known as point-referenced processes.

Definition Ageostatistical processis the stochastic process

{ Z ( s ) : sD } ,

where D is a fixed subset of the p -dimensional spaceR p (here we only consider p = 2). The locations s at which data could occur varycontinuouslyover D. However, in practice data are observed at afinitenumber of locations, and are denoted by Z = ( Z ( s 1 ) ,...,Z ( s n )).

Question

Why allow the stochastic process to vary continuously over a spatial domain when it is only observed at n locations?

A number of reasons, including: (i) although data are only observed at n locations, it is possible (at least in theory) to measure at infinitely many locations across the spatial domain D ; and (ii) predicting the process at unmeasured locations is a major goal of geostatistical analyses, and thus you need to define the stochastic process over the entire continuous domain.

Example

Data recording monthly mean temperature levels are available for June 2012, at n = 29 locations on the UK mainland. Possible questions of interest are:

  1. Produce a map showing the predicted temperature for the whole of the UK so the spatial pattern can be observed.

  2. Use the map to estimate the average temperature over the UK.

Given this partition of the study region, an areal unit process is the stochastic process

Z = ( Z ( B 1 ) ,...,Z ( Bn )) ,

which is only defined on the n areal units{ B 1 ,...,Bn }rather than at infinitely many possible locations. An alternative formulation that is sometimes used to represent the areal process is

Z = ( Z ( s 1 ) ,...,Z ( s n )) ,

where each s iD is a location representative of the region Bi , such as its central point known as itscentroid.

Example

Data were collected for a study investigating the potential link between lip cancer and exposure to sunlight, and were collected for n = 56 districts in Scotland. These districts were defined in 1973 and are no longer in use. The data are as follows:

  • The response variable Z = ( Z ( B 1 ) ,...,Z ( B 56 )) are the observed numbers of lip cancer cases registered between 1975 and 1986 for the 56 districts in Scotland.

  • The population sizes and demographic structures (e. percentages of the popula- tion in different age and sex groups) are not constant across the 56 districts, so to accurately measure disease risk the expected numbers of cases per district were cal- culated using indirect standardisation. These expected numbers of cases are denoted by E = ( E ( B 1 ) ,...,E ( B 56 )), and were computed using national age and sex specific disease rates for the whole of Scotland.

  • Based on these data a simple measure of disease risk is the standardized morbidity ratio (SMR), which is defined as the observed divided by the expected numbers of cases:

SMR ( Bi ) =

Z ( Bi ) E ( Bi )

.
  • Additionally, a proxy measure of population exposure to sunlight for each district was available, which was the percentage of the workforce in each district employed by agriculture, fishing and forestry.

Question: What effect does exposure to sunlight have on disease risk?

SMR

[0,2) [2,4) [4,6) [6,8]

% employed in agg/fish/forest

[−2,0) [0,4) [4,8) [8,13) [13,20) [20,26]

1 Point processes

Definition

Consider a spatial domain D , and let AD. Then let Z ( A ) denote the number of points that occur in A. Then the set

Z ={ Z ( A ) : A ⊂ D } ,

is aspatial point process. There are two types of spatial point process:

  • For anunmarked point processonly the locations of the data are recorded.

  • For amarked point processthe locations of the data are recorded, as are measurements about each location.

  • To find a statistical model that adequately explains the dependence observed in the data, to understand its spatial pattern.

- For the UK monthly mean temperatures, we could find a statistical model that accounts for the spatial trends (smooth changes in the mean), but also accounts for the fact that temperatures of locations nearby are more similar than those far apart (there is spatial dependence).

  • To predict the spatial process (along with a measure of uncertainty) at unmeasured locations, and create a prediction map of the process over the entire study region and not just at the data locations.

  • To use the predicted map to draw inference about the process being studied. - For the UK monthly mean temperature data, was there evidence of a decrease in temperatures in the North of Scotland relative to the South of England?

  • To estimate the effects of a covariate measured at the same locations in space as the response.

Goals for areal unit data

  • To find a statistical model that adequately explains the trends and dependence observed in the data, to understand its spatial pattern. - For the lip cancer data, one might be interested in determining the districts of Scotland at high risk of disease. This general area of spatial epidemiology is known asdisease mapping.

  • To use the statistical model to estimate the effects of an exposure on a response, while accounting for the fact that the residuals are spatially autocorrelated.

- For the lip cancer data, one might be interested in estimating the effect of sunlight on lip cancer risk. This general area of spatial epidemiology is known asecological regression.

Goals for point process data

  • To find a statistical model that adequately explains the spatial dependence structure in the data.

- For the biological cell locations example, we can estimate the intensity of the process in space. A region with a higher intensity will tend to have more points in it. - We may also be interested in determining if cells are clustered together or keep themselves apart from other cells or are spatially random.

2 Geostatistical processes

In this chapter we will describe how to model geostatistical data, and will illustrate how to implement the techniques using theRpackagegeoR.

2 Introduction

Definition Ageostatistical processis the stochastic process

{ Z ( s ) : sD } ,

where D is a fixed subset of the p -dimensional spaceR p. In this course we restrict attention to p = 2, so D ⊂R 2. The locations s = ( s 1 ,s 2 ) at which data could occur varycontinuously over D. However, in practice data are observed at afinitenumber of locations, and are represented by random variables denoted by Z = ( Z ( s 1 ) ,...,Z ( s n )).

Notes - Even though data are only observed at n spatial locations, we may (read will) want to predict the stochastic process at unknown locations to produce amapof the variable being modelled across the spatial domain D.

  • To produce such a map one predicts the spatial process Z ( s ) at N prediction locations s ∗={ s ∗ 1 ,..., sN }, where typically N >> n. These prediction locations typically form a regular grid, and hence when plotted form a map.

  • The key challenge when modelling spatial data compared with some types of non-spatial data is dependence / autocorrelation. Typically, geostatistical data will display positive autocorrelation, which means that the nearer in space two observations are, the more similar their values are likely to be.

  • This autocorrelation is caused by the variable of interest being affected by other unmeasured processes which are themselves spatially correlated. Ignoring spatial autocorrelation will lead to poor statistical inference (see later).

  • For example, air pollution concentrations are spatially correlated because they are caused by traffic, so if two pollution monitors are close to the same road they should have similar measurements.

Aims

When analysing geostatistical data one could have many different aims, but the main questions one is trying to answer are: - What is the value of the process at unmeasured locations?

head (soil)

easting northing 1 5340 5800 2 5590 5690 3 5990 5690 4 5990 5100 5 5780 4800 6 5590 4800

#### Format as a geodata object library (geoR) soil <- as (soil, coords= 1 : 2 , data= 3 , covar= 4 : 5 , borders=TRUE) soil$borders <- soil

## Plot the data plot (soil, lowess=TRUE)

5000 5400 5800

4800

5200

5600

X Coord

Y Coord

20 40 60 80

4800

5200

5600

data

Y Coord

5000 5400 5800

20

40

60

80

X Coord

data

data

Density

20 30 40 50 60 70 80

This example has a boundary to the study region, which can be added in two stages to the geodataobjectsoilin 2 stages as shown above.

What are the prominent features of these data?

  • A clear spatial trend with highest values in the south and lowest in the north.

  • The data distribution is fairly Gaussian in shape.

2 Geostatistical theory

2.2 Means, variances and covariances

Definition

Themean functionof the stochastic process{ Z ( s ) : sD }is

μ ( s ) =E[ Z ( s )] ∀ sD.

It is the theoretical mean/expectation at location s , taken over the distribution of all possible values that could have been generated from the stochastic process Z ( s ). When Z ( s ) is a continuous random variablethen

μ ( s ) =E[ Z ( s )] =

∫∞

−∞

zf ( z s ) dz,

where f (·| s ) is the probability density function (pdf) for Z ( s ). In contrast, when Z ( s ) is a discrete random variablewith sample spaceSthen

μ ( s ) =E[ Z ( s )] =

zi ∈S

zif ( zi | s ) ,

where f (·| s ) is the probability mass function (pmf) for Z ( s ). However, in this course we will focus on models for continuous data.

Definition

Theautocovariance functionof{ Z ( s ) : sD }is defined as

C ( s , t ) = Cov[ Z ( s ) ,Z ( t )] = E[( Z ( s )− μZ ( s ))( Z ( t )− μZ ( t ))] = E[ Z ( s ) Z ( t )]− μ ( s ) μ ( t ).

The autocovariance measures the strength of the linear dependence between Z ( s ) and Z ( t ). Thevariance functionof Z ( s ) is the special case of the autocovariance with s = t , which gives

for all positive integers n , real-valued constants ( a 1 ,...,an ) and spatial locations{ s 1 ,..., s n }.

Theorem The autocovariance function CZ ( s , t ) is a nonnegative definite function.

Proof

Consider the following weighted average of the geostatistical process{ Z ( s ) : sD }measured at n ≥1 locations{ s 1 ,..., s n }:

Y =

n

j =

ajZ ( s j ) ,

where ( a 1 ,...,an ) are real constants. Then the variance of Y is

Var[ Y ] = Cov[ Y,Y ]

= Cov

 

n

j =

ajZ ( s j ) ,

n

k =

akZ ( s k )

 

=

n

j =

n

k =

ajak Cov ( Z ( s j ) ,Z ( s k ))

=

n j =

n

k =

ajakC ( s j, s k )

= ≥ 0_._

The last line holds because variances cannot be negative.

Definition

Theautocorrelation functionof{ Z ( s ) : sD }is given by

ρ ( s , t ) = Corr[ Z ( s ) ,Z ( t )]

=

C ( s , t ) √ C ( s , s ) C ( t , t )

.

The autocorrelation function measures the strength of the linear association between Z ( s ) and Z ( t ), and is simply a scaled version of the autocovariance function. The autocorrelation function has the following properties: 1. ρ ( t , t ) = 1 for each tD.

  1. ρ ( s , t ) = ρ ( t , s ) for each s , tD.

  2. ρ ( s , t ) is a nonnegative definite function.

  3. − 1 ≤ ρ ( s , t )≤1 for all pairs s , tD.

Example

The simplest example of a spatial process is thewhite noiseprocess Z ( s ) that satisfies:

  1. E[ Z ( s )] = μsD ,

  2. C ( s , t ) =

{ τ 2 if s = t 0 Otherwise

,

which assumes that all observations are independent, i. have no spatial autocorrelation. This process is entirely useless for predicting the spatial process at unmeasured locations, because it is the spatial autocorrelation in the process that is used to predict at unmeasured locations. That is, under spatial autocorrelation if you predict the process at a location close to an existing measured data point, you would expect a similar value to that observed data point. Under independence this is not true.

2.2 Stationarity and isotropy

Essentially, stationarity means that the geostatistical process{ Z ( s ) : sD }has the same characteristics at any location in space, such as a constant mean, constant variance, etc. It does not mean that the data that arise from the process are all the same. There are two types of stationarity condition, strict stationarity and weak stationarity.

Definition

A geostatistical process{ Z ( s ) : sD }isstrictly stationaryif

f ( Z ( s 1 ) ,...,Z ( s n )) = f ( Z ( s 1 + h ) ,...,Z ( s n + h )) ,

for any displacement vector h and any set of n locations{ s 1 ,..., s n }. Essentially this means that the joint distribution of a set of random variables are unaffected by spatial shifts. If a process is strictly stationary then this means that:

  1. Z ( s ) has the same distribution for all sD as f ( Z ( s )) = f ( Z ( s + h )), which in turn means μ ( s ) = μ (the mean is constant) and ν 2 ( s ) = ν 2 (the variance is constant).
  2. The bivariate distribution also does not depend on spatial location, that is f ( Z ( s ) ,Z ( s + h )) = f ( Z ( 0 ) ,Z ( h )) for all s and h. This in turn means that

Cov[ Z ( s ) ,Z ( s + h )] = C ( s , s + h ) = CZ ( h ).

Was this document helpful?

Lecture-notes-complete

Module: Spatial Statistics (STATS5012)

24 Documents
Students shared 24 documents in this course
Was this document helpful?
Spatial Statistics 2020 - 4H and 5M
Contact details and sources of help
Lecturer - Professor Duncan Lee
Email - Duncan.Lee@glasgow.ac.uk
Office location - Room 342, Mathematics and Statistics building
Course webpage - https://moodle.gla.ac.uk/course/view.php?id=4739
Online help forum - https://padlet.com/duncan_lee1/Spatialstatistics
Course structure
Lectures - 20 in total at a rate of 2 per week.
Tutorials - 4 in total during the term.
Computer labs - 2 in total during the term.
Assessment
100% exam for both 4H and 5M variants in the April/May exam diet.
4H - 90 minute exam
5M - 120 minute exam
1