Moran’s I
Spatial Autocorrelation
The data
For this exercise, we will use the Pennsylvania county-level lung cancer counts for 2002. The data are provided by the SpatialEpi package, which we will install for this exercise. The dataset comes a tibble with a the following components:
- geo – A table of county IDs, with longitudes and latitudes of the geographical centroid of each county
- spatial.polygon – A SpatialPolygons object giving the boundaries of each county in latitude and longitude (geographical coordinates)
- data – A table of county IDs, number of cases, population subdivided by race, gender and age
- smoking – A table of county IDs and the percentage of smokers
library(tmap) library(spdep) library(tmaptools) library(SpatialEpi) library(sf) # Read in the Pennsylvania lung cancer data data(pennLC) # Extract the SpatialPolygon info penn.state.latlong <- st_as_sf(pennLC$spatial.polygon) # Extract the Smoking info penn.state.smoking <- pennLC$smoking # Combine the two data objects: spatial polygons and smoking penn.state <- cbind(penn.state.smoking, penn.state.latlong) # the cbind() function abouve creates a dataframe. Change that to an simple feature penn.state <- st_as_sf(penn.state) # Convert to UTM zone 17N # Why do we need to convert to UTM ? penn.state.utm <- st_transform(penn.state, 3724) if ("sf" %in% class(penn.state.utm)) penn.state.utm <- as(penn.state.utm,"Spatial") # create a new variable and assign it the smoking rates as actual percentages (not decimals) penn.state.utm$smk <- penn.state.utm$smoking * 100 # Prepare a choropleth map of the smoking rates tm_shape(penn.state.utm) + tm_polygons(col='smk',title='% of Popn.')
Do you see a degree of spatial autocorrelation from the choropleth map created above? If so, how would you describe it?
Visual Exploration
We can use a visual approach to test the claim that the underlying data values used in the choropleth above show a degree of spatial autocorrelation. We do this by comparing the pattern seen in the map to a set of random patterns, where the observed smoking rates are assigned randomly to counties.
# Set up a set of five 'fake' smoking update rates as well as the real one # Create new columns in penn.state.utm for randomised data # Here the seed 8125 (for the lab room 8-125) is used. You can use a different seed to get a different outcome but having the same seed allows everyone to get the same results. set.seed(8125) penn.state.utm$smk_rand1 <- sample(penn.state.utm$smk) penn.state.utm$smk_rand2 <- sample(penn.state.utm$smk) penn.state.utm$smk_rand3 <- sample(penn.state.utm$smk) penn.state.utm$smk_rand4 <- sample(penn.state.utm$smk) penn.state.utm$smk_rand5 <- sample(penn.state.utm$smk) # Scramble the variables used in terms of plotting order vars <- sample(c('smk','smk_rand1','smk_rand2','smk_rand3','smk_rand4','smk_rand5')) # Draw the scrambled map grid tm_shape(penn.state.utm) + tm_polygons(col=vars,legend.show=FALSE) + tm_layout(title=1:6,title.position=c("right","top"))
From the choropleth grid above, which is the real dataset? Do you see a tendency for spatial autocorrelation in the randomly generated patterns?
Neighbours and Lagged Mean Plots
An alternative to the visual approach is to compare the value in each county with the average value for its neighbours. This can be achieved via the lag.listw function in the spdep library. A lagged mean plot can be generated if we have a list of neighbours for each county. Neighbours can be defined in several ways, such as the queen’s case definition or the more restrictive rook’s case.
#for the queen's case neighborhood. If Rook's case is required add the optional argument queen = FALSE penn.state.nb <- poly2nb(penn.state.utm) penn.state.nb
As seen above, the nb object lists various characteristics, but it is also possible to plot it.
#create the queen's case neighborhood penn.state.nb.q <- poly2nb(penn.state.utm) # Create a SpatialLinesDataFrame showing the Queen's case contiguities penn.state.network.q <- nb2lines(penn.state.nb.q,coords=coordinates(penn.state.utm)) # Default projection is NA, but remember that is based on the UTM coordinates as seen above. # You can change this as below #change to simple feature penn.state.network.q <- st_as_sf(penn.state.network.q) #Assign Projection st_crs (penn.state.network.q) <- 3724 # Draw the projections tm_shape(penn.state.utm) + tm_borders(col='lightgrey') + tm_shape(penn.state.network.q) + tm_lines(col='red') #You can also determine the network based on # Calculate the Rook's case neighbours penn.state.nb.r <- poly2nb(penn.state.utm,queen=FALSE) # Convert this to a SpatialLinesDataFrame penn.state.network.r <- nb2lines(penn.state.nb.r,coords=coordinates(penn.state.utm)) # Change to simple features and assign projection penn.state.network.r <- st_as_sf(penn.state.network.r) st_crs(penn.state.network.r) <- 3724 # Plot the counties in background, then the two networks to compare: tm_shape(penn.state.utm) + tm_borders(col='lightgrey') + tm_shape(penn.state.network.q) + tm_lines(col='red',lwd = 2) + tm_shape(penn.state.network.r) + tm_lines(col='blue', lwd = 2)
The lagged mean plot is a plot of the value of z, for each polygon i against the mean of the z values for the neighbours of polygon i. First we will use the listw object to store the list of neighbours and their weights.
# Convert the neighbour list to a listw object using the Rook's case... penn.state.lw.r <- nb2listw(penn.state.nb.r) penn.state.lw.r #use the ?nb2listw to see more about the weight styles penn.state.utm$smk.lagged.means.r <- lag.listw(penn.state.lw.r,penn.state.utm$smk) #create a choropleth of the lagged means tm_shape(penn.state.utm) + tm_polygons(col='smk.lagged.means.r',title='% of Popn.') + tm_layout(legend.bg.color = "white") #Finally create a scatter plot of the lagged means against the smoking values #here the line y=x is added as reference with(data.frame(penn.state.utm), { plot(smk,smk.lagged.means.r,asp=1,xlim=range(smk),ylim=range(smk)) abline(a=0,b=1) # recall y = a + bx; a - intercept on the y-axis & b - slope abline(v=mean(smk),lty=2) abline(h=mean(smk.lagged.means.r),lty=2) })
Moran’s I
Note that the plot above is also known as the Moran’s Plot.
#create the Moran's plot moran.plot(penn.state.utm$smk,penn.state.lw.r) #calculate the Moran's Statistic moran.test(penn.state.utm$smk,penn.state.lw.r)
Assignment
You are provided with the Ohio dataset in the Lab 5 folder. The assignment is for you to determine whether there is spatial autocorrelation with respect to the senior population (SENIORPCT) in the state.
- Assume that the data are provided in WGS 84 projection (4326) and that UTM Zone 17N (EPSG:3747) is adequate for the analysis
- Use the visual exploration to see if there is spatial autocorrelation
- Use the lagged mean plot to see if there is spatial autocorrelation
- Finally, calculate Moran’s statistic, which will hopefully confirm your analysis.
Submit your results in a single PDF document. Show your images and accompanying analysis and explanations.