Moran’s I

Spatial Autocorrelation

The data

For this exercise, we will use the Pennsylvania county-level lung cancer counts for 2002. The data are provided by the SpatialEpi package, which we will install for this exercise. The dataset comes a tibble with a the following components:

  • geo – A table of county IDs, with longitudes and latitudes of the geographical centroid of each county
  • spatial.polygon – A SpatialPolygons object giving the boundaries of each county in latitude and longitude (geographical coordinates)
  • data – A table of county IDs, number of cases, population subdivided by race, gender and age
  • smoking – A table of county IDs and the percentage of smokers
library(tmap)
library(spdep)
library(tmaptools)
library(SpatialEpi)
library(sf)

# Read in the Pennsylvania lung cancer data
data(pennLC)

# Extract the SpatialPolygon info
penn.state.latlong <- st_as_sf(pennLC$spatial.polygon)

# Extract the Smoking info
penn.state.smoking <- pennLC$smoking

# Combine the two data objects: spatial polygons and smoking 
penn.state <- cbind(penn.state.smoking, penn.state.latlong)

# the cbind() function abouve creates a dataframe. Change that to an simple feature
penn.state <- st_as_sf(penn.state)

# Convert to UTM zone 17N
# Why do we need to convert to UTM ?
penn.state.utm <- st_transform(penn.state, 3724)
if ("sf" %in% class(penn.state.utm)) 
  penn.state.utm <- as(penn.state.utm,"Spatial")


# create a new variable and assign it the smoking rates as actual percentages (not decimals)
penn.state.utm$smk <- penn.state.utm$smoking * 100

#  Prepare a choropleth map of the smoking rates
tm_shape(penn.state.utm) + tm_polygons(col='smk',title='% of Popn.')

Do you see a degree of spatial autocorrelation from the choropleth map created above? If so, how would you describe it?

Visual Exploration

We can use a visual approach to test the claim that the underlying data values used in the choropleth above show a degree of spatial autocorrelation. We do this by comparing the pattern seen in the map to a set of random patterns, where the observed smoking rates are assigned randomly to counties.

# Set up a set of five 'fake' smoking update rates as well as the real one
# Create new columns in penn.state.utm for randomised data
# Here the seed 8125 (for the lab room 8-125) is used.  You can use a different seed to get a different outcome but having the same seed allows everyone to get the same results.

set.seed(8125)
penn.state.utm$smk_rand1 <- sample(penn.state.utm$smk)
penn.state.utm$smk_rand2 <- sample(penn.state.utm$smk)
penn.state.utm$smk_rand3 <- sample(penn.state.utm$smk)
penn.state.utm$smk_rand4 <- sample(penn.state.utm$smk)
penn.state.utm$smk_rand5 <- sample(penn.state.utm$smk)

# Scramble the variables used in terms of plotting order
vars <- sample(c('smk','smk_rand1','smk_rand2','smk_rand3','smk_rand4','smk_rand5'))


# Draw the scrambled map grid
tm_shape(penn.state.utm) + tm_polygons(col=vars,legend.show=FALSE) +
  tm_layout(title=1:6,title.position=c("right","top"))

From the choropleth grid above, which is the real dataset? Do you see a tendency for spatial autocorrelation in the randomly generated patterns?

Neighbours and Lagged Mean Plots

An alternative to the visual approach is to compare the value in each county with the average value for its neighbours. This can be achieved via the lag.listw function in the spdep library. A lagged mean plot can be generated if we have a list of neighbours for each county. Neighbours can be defined in several ways, such as the queen’s case definition or the more restrictive rook’s case.

#for the queen's case neighborhood. If Rook's case is required add the optional argument queen = FALSE
penn.state.nb <- poly2nb(penn.state.utm)
penn.state.nb

As seen above, the nb object lists various characteristics, but it is also possible to plot it.

#create the queen's case neighborhood
penn.state.nb.q <- poly2nb(penn.state.utm)

# Create a SpatialLinesDataFrame showing the Queen's case contiguities
penn.state.network.q <- nb2lines(penn.state.nb.q,coords=coordinates(penn.state.utm))

# Default projection is NA, but remember that is based on the UTM coordinates as seen above.
# You can change this as below

#change to simple feature
penn.state.network.q <- st_as_sf(penn.state.network.q)

#Assign Projection
st_crs (penn.state.network.q) <- 3724

# Draw the projections
tm_shape(penn.state.utm) + tm_borders(col='lightgrey') + 
  tm_shape(penn.state.network.q) + tm_lines(col='red')


#You can also determine the network based on 

# Calculate the Rook's case neighbours
penn.state.nb.r <- poly2nb(penn.state.utm,queen=FALSE)
# Convert this to a SpatialLinesDataFrame
penn.state.network.r <- nb2lines(penn.state.nb.r,coords=coordinates(penn.state.utm))

# Change to simple features and assign projection
penn.state.network.r <- st_as_sf(penn.state.network.r)
st_crs(penn.state.network.r) <-  3724

# Plot the counties in background,  then the two networks to compare: 
tm_shape(penn.state.utm) + tm_borders(col='lightgrey') + 
  tm_shape(penn.state.network.q) + tm_lines(col='red',lwd = 2) +
  tm_shape(penn.state.network.r) + tm_lines(col='blue', lwd = 2) 

The lagged mean plot is a plot of the value of z, for each polygon i against the mean of the z values for the neighbours of polygon i. First we will use the listw object to store the list of neighbours and their weights.

# Convert the neighbour list to a listw object using the Rook's case...
penn.state.lw.r <- nb2listw(penn.state.nb.r)
penn.state.lw.r

#use the ?nb2listw to see more about the weight styles 

penn.state.utm$smk.lagged.means.r <- lag.listw(penn.state.lw.r,penn.state.utm$smk)

#create a choropleth of the lagged means
tm_shape(penn.state.utm) + tm_polygons(col='smk.lagged.means.r',title='% of Popn.')  +
  tm_layout(legend.bg.color = "white") 

#Finally create a scatter plot of the lagged means against the smoking values 
#here the line y=x is added as reference
with(data.frame(penn.state.utm), {
  plot(smk,smk.lagged.means.r,asp=1,xlim=range(smk),ylim=range(smk))
  abline(a=0,b=1) # recall y = a + bx; a - intercept on the y-axis & b - slope
  abline(v=mean(smk),lty=2)
  abline(h=mean(smk.lagged.means.r),lty=2)
})
Moran’s I

Note that the plot above is also known as the Moran’s Plot.

#create the Moran's plot
moran.plot(penn.state.utm$smk,penn.state.lw.r)

#calculate the Moran's Statistic
moran.test(penn.state.utm$smk,penn.state.lw.r)

Assignment

You are provided with the Ohio dataset in the Lab 5 folder. The assignment is for you to determine whether there is spatial autocorrelation with respect to the senior population (SENIORPCT) in the state.

  • Assume that the data are provided in WGS 84 projection (4326) and that UTM Zone 17N (EPSG:3747) is adequate for the analysis
  • Use the visual exploration to see if there is spatial autocorrelation
  • Use the lagged mean plot to see if there is spatial autocorrelation
  • Finally, calculate Moran’s statistic, which will hopefully confirm your analysis.

Submit your results in a single PDF document. Show your images and accompanying analysis and explanations.