GEOG413/613 – Lab 4 – UNBC GIS Lab

In today’s lab, we will be looking at some of the more fundamental exploratory analysis tools. Cluster Analysis (using K-Means), and Quadrat Analysis.

K-Means Clustering

Clustering algorithms are useful for separating data into groups with similar qualities, clustering is often not the end goal in and of itself, but it does provide a means to start looking for patterns, once groups have been formed it is then possible to take a deeper look into what is driving those clusters allowing us to find patterns in the attributes.

Below is a demo of the K-Means algorithm created using shiny, much like we made our maps in lab 5. This demo allows you to step through the iterations to see the impacts.

https://connect.gis.unbc.ca/content/8453154e-321e-48eb-9304-7375277ae188

In this example, we are working in 2D space that is to say 2 attributes, and distance is calculated as the distance between point p, and cluster k. This is relatively easy to understand, and provides a helpful visual.

Firstly, let’s look at the Guerry dataset documentation file to make sure we understand the data.

library(tidyverse)
library(sf)
library(rgdal)
library(leaflet)



guerry_shp <- st_read("guerry/guerry.shp")
ggplot() +
  geom_sf(data = guerry_shp, color = 'blue', aes(geometry = geometry))
  
#to perform K-means analysis we need to make covert the data from the simple feature format to a dataframe
head(guerry_shp)

guerry_df <- as.data.frame(guerry_shp)

head(guerry_df)

# Compute K-Means
num_clusters <- 5
guerry_ke <- kmeans(guerry_df[, c('Crm_prs', 'Crm_prp', 'Litercy', 'Donatns', 
                                  'Infants', 'Suicids')], num_clusters)
  
# Bind cluster to input data !!! Do Not Sort Clusters !!!
guerry_clusters <- cbind(guerry_shp, guerry_ke$cluster)

# Convert to sf for leaflet
guerry_sf <- st_as_sf(guerry_clusters)


# Rename cluster column
colnames(guerry_sf)[ncol(guerry_shp)] <- "cluster"


# Make colour palate 
cols = rainbow(num_clusters)

# Show Leaflet map
leaflet(guerry_sf %>% 
          st_transform(4326))  %>% # Leaflet does not understand UTM!
  addProviderTiles("CartoDB.Positron") %>% # Add basemap
  addPolygons(color = cols, fillOpacity = 1) # Show different clusters as different colours

One more tip, as the clusters start in random locations, the output may vary slightly based on the starting state, if you need consistent output for testing purposes you can ‘seed’ the random number generator, before calling kmeans() e.g #set.seed(123)

Quadrat Analysis

Quadrat Analysis allows for exploratory searching of patterns in locations of points. In today’s lab we will be looking at the location of crashes on the lower mainland, using the spatStat package. *Note that spatStat does not natively work with sf or ggplot, so our code will look a little different

Like always we start by loading some data, this is a three-step process first we open the data, then convert the coordinates into locations, finally converting the sf data into spatial point patterns (ppp).

crash_data <- read.csv2("Lower_mainland_Crashes_expanded.csv") %>% select(Latitude, Longitude, Location)
#Note the use of read.csv2 vs the read_csv function we have used previously


#Convert the table to sf objects, note data is in WGS84 CRS (map projection)
crash_locations <- st_as_sf(crash_data, 
                            coords = c("Longitude", "Latitude"), crs = 4326) %>% 
  st_transform(32610)

# SpatStat requires data as ppp, transform to NAD 83 UTM Zone 10N. Never
# calculate statistics on unprojected data!
crash_points <- as.ppp(crash_locations %>% st_transform(32610))

# Minimize plot margins to help see points, set default point to red
# Strange bug where col in plot() changes background not point colour so moved to par() function
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(crash_points, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F) 

# Count quadrats
Q <- quadratcount(crash_points, nx= 5, ny=5) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q, main='q-plot', col = alpha('black', 1), add=T)

Once all the points have been counted the next step is the chi-squared test, done with the quadrat.test() function.

# Chi-squared test
qtest <- quadrat.test(Q)
# Print results to console
qtest

# Optionally plot quadrats (count, expected count, deviation)
plot(qtest, cex = .7, col = 'Black')

Lets talk about the output.

	#Chi-squared test of CSR using quadrat counts

#data:  
#X2 = 140895, df = 24, p-value < 2.2e-16
#alternative hypothesis: two.sided

#Quadrats: 5 by 5 grid of tiles

The interpretation of this is that Chi-squared (X2) is larger than the critical value (df), suggesting that this data is clustered; as the p-value is less than 0.05 we can take these results to be statistically significant.

Data Prep is Important

You may have noticed the data we loaded was called Lower_mainland_Crashes_expanded. This expanded dataset contains extra points for each extra crash at that location. However, these extra crashes are listed in a column as a count of crashes at that location on the original data. If you try the above again changing to the original data how do your results differ?

crash_data <- read_csv("Lower_mainland_Crashes_data.csv")
# how does the chi-square value for this dataset differ from the expanded one? What does this tell us.

Looking Through the Window

Another interesting point in these data is that there is a lot of ocean in some of the quadrats, and cars generally do not crash in the ocean. We can try to account for this by applying a window to the quadrat analysis.

# Open lowermainland shp as window
outline <- st_read(dsn = "lower_mainland", layer = 'lower_mainland')

crash_data <- read_csv("Lower_mainland_Crashes_data.csv") %>% select(Latitude, Longitude, Location)

# Convert spreadsheet to sf objects, note data is in WGS84 CRS
crash_locations <- st_as_sf(crash_data, 
                            coords = c("Longitude", "Latitude"), crs = 4326) %>% 
                            st_transform(32610)

# SpatStat requires data as ppp, transform to UTM 10N. 
crash_points <- as.ppp(crash_locations %>% st_transform(32610)) 

# Set window on crash points
window <- as.owin(outline)
crash_points_window <- crash_points[window]

# Count quadrats
Q_window <- quadratcount(crash_points_window, nx=5, ny=5) 

# Minimize plot margins to help see points, set default point to red
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(crash_points_window, pch=20, cex = 0.5, main='q_window-plot', legend = F, label_value = F) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q_window, main='q_window-plot', col = alpha('black', 1), add=T) 

# Chi-squared test
q_window_test <- quadrat.test(Q_window)
q_window_test
plot(q_window_test, cex = 0.7, col = 'black')

More on the quadrats (windows) using synthetic data

synth_data_points <- st_read(dsn = "synthetic_dataset", layer = 'random_points')


#SpatStat requires data as ppp, but you will notice points are already in UTM Zone 10 so no need to do a coordinate transformation
synth_data_points <- as.ppp(synth_data_points) 

#define the number of quadrats (grid size for x and y axis)
wx <- 8
wy <- 8

Q_synth_window <- quadratcount(synth_data_points, nx= wx, ny=wy) 

# Minimize plot margins to help see points, set default point to red
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(synth_data_points, pch=20, cex = 0.5, legend = F, label_value = F) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q_synth_window, col = alpha('black', 1), add=T) 
q_synth_test <- quadrat.test(Q_synth_window)

#see the results of your
q_synth_test

##LETS EXPLORE

#the expected frequency would be the total number of cases(counts) divided by the number of quadrats
ef <- synth_data_points$n/(wx*wy)

#we need weigh the quadrat frequencies by using the square of the deviations divided by the expected frequency. If the sum is small then there is no need to reject the null hypothesis. If it is big then we reject the null hypothesis
qw <- as.data.frame(Q_synth_window)

dS <- sum(((qw$Freq-ef)^2)/ef)
dS

#Repeat the above with in different grid sizes 8 by 8; 7 by 7,..., 2 by 2 and see the results

Assignment

Answer the 3 questions below adding the missing lines to the code.
Submit your completed R script and screenshots of plots produced (or right click on graphs and save as to download to your computer). Make sure to caption the plots in the pdf document.

(there is more here on color palettes)

library(tidyverse)
library(sf)
library(sp)
library(leaflet)
library(spatstat)
library(dplyr)
library(RColorBrewer)

################################################
### Question 1: Which city in the layer 5CCITIES 
### stands out distinctily based upon the K-means analysis Marital Status
### NEVERMARRY, MARRIED, SEPARATED, WIDOWED, DIVORCED
### Use 5 clusters in K-Means
################################################

# Open data
five_ccities <- st_read(dsn = 'assignment', layer = '5CCITIES')

# Convert to DataFrame
five_ccities_df <- #...

# K-Means Cluster
five_ccities_ke <- #...

# Bind clusters, and name column

five_ccities_clusters <- cbind(...)
colnames(five_ccities_clusters)[ncol(five_ccities)] <- "cluster"

# Convert to spatial dataframe for leaflet compatability and assign CRS
five_cities_sf <- st_as_sf(five_ccities_clusters)
st_crs(five_cities_sf)  <- st_crs(4326)

# Create colour pallet for leaflet
cols = brewer.pal(num_clusters, "Set2")
pal <- colorFactor(cols, five_cities_sf$cluster)


# Show Leaflet map
# Show Leaflet map
leaflet(five_cities_sf) %>% 
  addProviderTiles("CartoDB.Positron") %>% # Add basemap
  addCircleMarkers(col = pal(five_cities_sf$cluster), fillOpacity = 1, radius = 3) # Show different clusters as different colours



# Display how many points in each cluster
five_cities_sf %>% group_by(cluster) %>% summarize(num_points = n())


################################################
### Question 2: Which two cities in the layer 5CCITIES 
### stand out the most based upon housing status
### VACANT, OWNER_OCC, RENTER_OCC
### Normalize columns by HSE_UNITS
### Use 5 clusters in K-Means
################################################

# Open data
five_ccities <- st_read(dsn = 'assignment', layer = '5CCITIES')

# Convert to DataFrame

# normalize the selected variables
norm_five_ccities_df <- five_ccities_df[, c('VACANT', 'OWNER_OCC', 'RENTER_OCC')] * 1/five_ccities_df[, c('HSE_UNITS')]

# K-Means Cluster

# Bind clusters, and name column


# Convert to spatial dataframe for leaflet compatability


# Create colour pallet for leaflet


# Show Leaflet map


# Display how many points in each cluster




################################################
### 3 Pts
### Question 3: emap, and dipin layers both contain
### information on water quality in lakes; using
### quadrat analysis determine how these layers
### compare in terms of point distribution
################################################

# Open Data
dipin <- st_read(dsn = 'assignment', layer = 'dipin') 
emap <- st_read(dsn = 'assignment', layer = 'emap')

# You will notice the shapefiles do not have CRS defined.
# Assign CRS 4326 which will enable us to reproject to UTM. See the code in Qn 1 above
st_crs(dipin) <- 
st_crs(emap) <- 

# Transform to NAD83 / UTM zone 19N, EPSG:26919
dipin_proj <- st_transform(dipin,26919)
emap_proj <- st_transform(emap,26919)

# Create spatial point patterns


# Create window of North East States
#read data
ne_usa <- st_read(dsn = 'assignment', layer = 'ne-usa')
#assign WGS84 projection. 
st_crs(ne_usa) <- st_crs(4326)

#project to UTM
ne_usa_proj <- st_transform(...)
       

outline <- st_union(st_combine(ne_usa_proj))
window <- as.owin(outline)

# Set window on ppp data
        

# Plot points
par(plt=c(0, 1, 0, 1), bg = 'white', col = 'red') 
plot(dipin_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, 
     label_value = F, which.marks=1)
par(col='blue')
plot(emap_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, 
     label_value = F, which.marks=1, add=T) 

# Count quadrats


# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(dipin_qc, main='q-plot', col = alpha('black', 1), add=T) 

# Chi-squared test


# Print results to console


# Optionally plot quadrats (count, expected count, deviation)
plot(dipin_chi2, cex = .5, col = 'Black')

#### Both dipin and emap have same critical value, both are clustered, and have
#### good p-values, however we can say that dipin is more clustered than e-map
#### an interesting result given the obvious cluster in emap in North West region

GEOG413/613 – Lab 4

Published by Anthony Jjumba on 2024-10-082024-10-08

K-Means Clustering

Quadrat Analysis

Data Prep is Important

Looking Through the Window

More on the quadrats (windows) using synthetic data

Assignment

Lab 7/8 PostgreSQL and Leaflet

GEOG413/613 – Lab 5

GEOG413/613 – Lab 3

GEOG413/613 – Lab 4

Published by Anthony Jjumba on 2024-10-082024-10-08

K-Means Clustering

Quadrat Analysis

Data Prep is Important

Looking Through the Window

More on the quadrats (windows) using synthetic data

Assignment

Related Posts

Lab 7/8 PostgreSQL and Leaflet

GEOG413/613 – Lab 5

GEOG413/613 – Lab 3