Lab 07 | K-Means Clustering and Quadrat Analysis

In today’s lab, we will be looking at some of the more fundamental exploratory analysis tools. Cluster Analysis (using K-Means), and Quadrat Analysis,

K-Means Clustering

Clustering algorithms are useful for separating data into groups with similar qualities, clustering is often not the end goal in and of itself, but it does provide a means to start looking for patterns, once groups have been formed it is then possible to take a deeper look into what is driving those clusters allowing us to find patterns in the attributes.

Below is a demo of the K-Means algorithm created using shiny, much like we made our maps in lab 5. This demo allows you to step through the iterations to see the impacts.

https://connect.gis.unbc.ca/content/8453154e-321e-48eb-9304-7375277ae188/

In this example, we are working in 2D space that is to say 2 attributes, and distance is calculated as the distance between point p, and cluster k is.

\[d\ =\ \sqrt{\left(p.x-k.x\right)^2\ +\left(p.y\ -\ k.y\right)^2}\]

This is relatively easy to understand, when moving into n columns this formula will change to

\[\mathrm{d=}\sqrt{\sum_{\mathrm{i=1}}^{\mathrm{n}}\mathrm{p.}\mathrm{i}^{\mathrm{2}}\mathrm{-k.}\mathrm{i}^{\mathrm{2}}}\]

As a refresher, we will do the same exercise we did in class last week.

Open RStudio Workbench (https://rstudio.gis.unbc.ca) and start by importing your libraries

library(tidyverse)
library(maptools)
library(sf)
library(rgdal)
library(leaflet)

Next, open the data from the L drive and plot it

guerry_shp <- readOGR(dsn = "~/L/GEOG413/lab07/guerry", layer = 'guerry')
plot(guerry_shp)

The K-Means is simple to calculate in R, however it must first be converted into a dataframe

# Convert to data.frame
guerry_df <- as.data.frame(guerry_shp)

# Compute K-Means
num_clusters <- 5
guerry_ke <- kmeans(guerry_df[, c('Crm_prs', 'Crm_prp', 'Litercy', 'Donatns', 
                                  'Infants', 'Suicids')], num_clusters)

Currently, we have a list of clusters, but to see what region belongs to each cluster we must bind the computed clusters to our original data.

# Bind cluster to input data !!! Do Not Sort Clusters !!!
guerry_clusters <- cbind(guerry_shp, guerry_ke$cluster)

# Convert to sf for leaflet
guerry_sf <- st_as_sf(guerry_clusters)

# Rename cluster column (index is 1 larger than number of input columns)
colnames(guerry_sf)[ncol(guerry_shp)+1] <- "cluster"

Finally look at the results on the map

# Make colour palate 
cols = rainbow(num_clusters)

# Show Leaflet map
leaflet(guerry_sf %>% 
  st_transform(4326))  %>% # Leaflet does not understand UTM!
  addProviderTiles("CartoDB.Positron") %>% # Add basemap
  addPolygons(color = cols, fillOpacity = 1) # Show different clusters as different colours

One more tip, as the clusters start in random locations, the output may vary slightly based on the starting state, if you need consistent output for testing purposes you can ‘seed’ the random number generator, before calling kmeans()

set.seed(42)

Quadrat Analysis

Quadrat Analysis allows for exploratory searching of patterns in locations of points. In today’s lab we will be looking at the location of crashes on the lower mainland, using the spatStat package. *Note that spatStat does not natively work with sf or ggplot, so our code will look a little different than previous labs.

Like always we start by loading some data, this is a three-step process first we open the data, then convert the coordinates into locations, finally converting the sf data into spatial point patterns (ppp).

# Start of Lab
crash_data <- read.csv2("~/L/GEOG413/lab07/Lower_mainland_Crashes_expanded.csv") %>% select(Latitude, Longitude, Location)

# Convert spreadsheet to sf objects, note data is in WGS84 CRS
crash_locations <- st_as_sf(crash_data, 
                            coords = c("Longitude", "Latitude"), crs = 4326) %>% 
                            st_transform(32610)

# SpatStat requires data as ppp, transform to UTM 10N. Do not
# Calculate statistics on unprojected data!
crash_points <- as.ppp(crash_locations %>% st_transform(32610))

Examine the data by plotting

# Minimize plot margins to help see points, set default point to red
# Strange bug where col in plot() changes background not point colour so moved to par()
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(crash_points, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F)

Next, to perform quadrat analysis the data needs to be split into quadrats, and the points inside counted, note that the size of your quadrats will affect the answer, we will talk more about this; for now to make the plots easy to see we will work with a 5×5 grid of quadrats.

# Count quadrats
Q <- quadratcount(crash_points, nx= 5, ny=5) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q, main='q-plot', col = alpha('black', 1), add=T)

Once all the points have been counted the next step is the chi-squared test, done with the quadrat.test() function.

# Chi-squared test
qtest <- quadrat.test(Q)
# Print results to console
qtest

# Optionally plot quadrats (count, expected count, deviation)
plot(qtest, cex = .7, col = 'Black')

Lets talk about the output

	Chi-squared test of CSR using quadrat counts

data:  
X2 = 140895, df = 24, p-value < 2.2e-16
alternative hypothesis: two.sided

Quadrats: 5 by 5 grid of tiles

The interpretation of this is that Chi-squared (X2) is larger than the critical value (df), suggesting that this data is clustered; as the p-value is less than 0.05 we can take these results to be statistically significant.

Data Prep matters

You may have noticed the data we loaded was called Lower_mainland_Crashes_expanded; what does this mean? On the original data each point contained a count of crashes at that location, the expanded dataset contains extra points for each extra crash. If you try the above again changing to the original data how do your results differ?

crash_data <- read.csv2("~/L/GEOG413/lab07/Lower_mainland_Crashes_data.csv", sep = ',')

The original data has a smaller Chi-squared, what does that tell us about the data?

Looking Through the Window

Another interesting point in this data is that there is a lot of ocean, and cars generally do not crash in the ocean. We can account for this by applying a window to the quadrat analysis.

# Open lowermainland shp as window
outline <- readOGR(dsn = "~/L/GEOG413/lab07/lower_mainland", layer = 'lower_mainland')

crash_data <- read.csv2("~/L/GEOG413/lab07/Lower_mainland_Crashes_expanded.csv") %>% select(Latitude, Longitude, Location)

# Convert spreadsheet to sf objects, note data is in WGS84 CRS
crash_locations <- st_as_sf(crash_data, 
                            coords = c("Longitude", "Latitude"), crs = 4326) %>% 
                            st_transform(32610)

# SpatStat requires data as ppp, transform to UTM 10N. 
crash_points <- as.ppp(crash_locations %>% st_transform(32610)) 

# Set window on crash points
window <- as.owin(outline)
crash_points_window <- crash_points[window]

# Count quadrats
Q_window <- quadratcount(crash_points_window, nx=5, ny=5) 

# Minimize plot margins to help see points, set default point to red
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(crash_points_window, pch=20, cex = 0.5, main='q_window-plot', legend = F, label_value = F) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q_window, main='q_window-plot', col = alpha('black', 1), add=T) 

# Chi-squared test
q_window_test <- quadrat.test(Q_window)
q_window_test
plot(q_window_test, cex = 0.7, col = 'black')

In this case, you can see that the data is appearing less clustered and that the expected counts are not the same for all quadrats as they are scaled based on area.

Evenly Spaced Data

regular_points <- readOGR(dsn = "~/L/GEOG413/lab07/synthetic_data", layer = 'regular_points')
regular_locations <- st_as_sf(regular_points)

# SpatStat requires data as ppp, transform to UTM 10N. 
crash_points_regular <- as.ppp(regular_locations %>% st_transform(32610)) 

# Open lowermainland shp as window
outline <- readOGR(dsn = "~/L/GEOG413/lab07/lower_mainland", layer = 'lower_mainland')
plot(outline)
window <- as.owin(outline)

# Set window on crash points
crash_points_regular_window <- crash_points_regular[window]

# Count quadrats
Q_regular_window <- quadratcount(crash_points_regular_window, nx= 5, ny=5) 

# Minimize plot margins to help see points, set default point to red
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(crash_points_regular_window, pch=20, cex = 0.5, main='q_window_regular-plot', legend = F, label_value = F) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q_regular_window, main='q_window_regular-plot', col = alpha('black', 1), add=T) 
q_regular_window_test <- quadrat.test(Q_regular_window)
q_regular_window_test
plot(q_regular_window_test, cex = 0.5, col = alpha('black', 1))

	Chi-squared test of CSR using quadrat counts

data:  
X2 = 7.8595, df = 21, p-value = 0.008577
alternative hypothesis: two.sided

Quadrats: 22 tiles (irregular windows)

Randomly Distributed Data

random_points <- readOGR(dsn = "~/L/GEOG413/lab07/synthetic_data", layer = 'random_points')
random_locations <- st_as_sf(random_points)

# SpatStat requires data as ppp, transform to UTM 10N. 
crash_points_random <- as.ppp(random_locations %>% st_transform(32610)) 

# Open lowermainland shp as window
outline <- readOGR(dsn = "~/L/GEOG413/lab07/lower_mainland", layer = 'lower_mainland')
plot(outline)
window <- as.owin(outline)

# Set window on crash points
crash_points_random_window <- crash_points_random[window]

# Count quadrats
Q_random_window <- quadratcount(crash_points_random_window, nx= 5, ny=5) 

# Minimize plot margins to help see points, set default point to red
par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) 

# Plot points
plot(crash_points_random_window, pch=20, cex = 0.5, main='q_window_random-plot', legend = F, label_value = F) 

# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(Q_random_window, main='q_window_random-plot', col = alpha('black', 1), add=T) 
q_random_window_test <- quadrat.test(Q_random_window)
q_random_window_test
plot(q_random_window_test, cex = 0.7, col = alpha('black', 1))

	Chi-squared test of CSR using quadrat counts

data:  
X2 = 23.694, df = 21, p-value = 0.6163
alternative hypothesis: two.sided

Quadrats: 22 tiles (irregular windows)

Assignment

Answer the 3 questions below adding the missing lines to the code.
Submit your completed R script and screenshots of plots produced (or right click on graphs and save as to download to your computer). Make sure to caption the plots in a word document as the labels have been hidden to give more space.

library(tidyverse)
library(maptools)
library(sf)
library(sp)
library(rgdal)
library(leaflet)
library(spatstat)
library(dplyr)
library(rgeos)

################################################
### Question 1: Which city in the layer 5CCITIES 
### is the most unique based upon Marital Status
### NEVERMARRY, MARRIED, SEPARATED, WIDOWED, DIVORCED
### Use 5 clusters in K-Means
################################################

# Open data
five_ccities <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = '5CCITIES')

# Convert to DataFrame


# K-Means Cluster


# Bind clusters, and name column

colnames(city_clusters@data)[length(five_ccities@data) + 1] <- 'clusters'

# Convert to spatial dataframe for leaflet compatability
city_sf <- st_as_sf(city_clusters)

# Create colour pallet for leaflet
pal <- colorFactor(rainbow(nrow(city_sf)), domain = city_sf$clusters)

# Show Leaflet map


# Display how many points in each cluster
city_sf %>% group_by(clusters) %>% summarize(num_points = n())


################################################
### Question 2: Which two cities in the layer 5CCITIES 
### is the most unique based upon housing status
### VACANT, OWNER_OCC, RENTER_OCC
### Normalize columns by HSE_UNITS
### Use 5 clusters in K-Means
################################################

# Open data
five_ccities <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = '5CCITIES')

# Convert to DataFrame


# K-Means Cluster

# Bind clusters, and name column
city_clusters <- cbind(five_ccities, klusters$cluster)
colnames(city_clusters@data)[length(five_ccities@data) + 1] <- 'clusters'

# Convert to spatial dataframe for leaflet compatability
city_sf <- st_as_sf(city_clusters)

# Create colour pallet for leaflet
pal <- colorFactor(rainbow(nrow(city_sf)), domain = city_sf$clusters)

# Show Leaflet map


# Display how many points in each cluster
city_sf %>% group_by(clusters) %>% summarize(num_points = n())



################################################
### 3 Pts
### Question 3: emap, and dipin layers both contain
### information on water quality in lakes; using
### quadrat analysis determine how these layers
### compare in terms of point distribution
################################################

# Open Data
dipin <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = 'dipin') 
emap <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = 'emap')

# Shapfiles do not have CRS defined, define to allow transformation
proj4string(dipin) <- 
proj4string(emap) <- 

# Transform to psuedo-webmercator projection
dipin_proj <- 
emap_proj <- 

# Create spatial point patterns


# Create window of North East States
ne_usa <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = 'ne-usa')
proj4string(ne_usa) <- CRS("+init=epsg:4326")
ne_usa_proj <- spTransform(ne_usa, CRS("+init=epsg:3857"))
outline <- gUnaryUnion(ne_usa_proj)
window <- as.owin(outline)

# Set window on ppp data

# Plot points
par(plt=c(0, 1, 0, 1), bg = 'white', col = 'red') 
plot(dipin_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, 
     label_value = F, which.marks=1)
par(col='blue')
plot(emap_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, 
     label_value = F, which.marks=1, add=T) 

# Count quadrats


# Fade points to make counts readable
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) 

# Add quadrat grid
plot(dipin_qc, main='q-plot', col = alpha('black', 1), add=T) 

# Chi-squared test


# Print results to console


# Optionally plot quadrats (count, expected count, deviation)
plot(dipin_chi2, cex = .5, col = 'Black')

#### Both dipin and emap have same critical value, both are clustered, and have
#### good p-values, however we can say that dipin is more clustered than e-map
#### an interesting result given the obvious cluster in emap in North West region

Lab 07 | K-Means Clustering and Quadrat Analysis

Published by Matt McLean on 2022-08-152022-08-15

K-Means Clustering

Quadrat Analysis

Data Prep matters

Looking Through the Window

Evenly Spaced Data

Randomly Distributed Data

Assignment

Final Project

Lab 09 | Bivariate Analysis

Lab 10 | Agent Based Modeling (GEOG 613)

Lab 07 | K-Means Clustering and Quadrat Analysis

Published by Matt McLean on 2022-08-152022-08-15

K-Means Clustering

Quadrat Analysis

Data Prep matters

Looking Through the Window

Evenly Spaced Data

Randomly Distributed Data

Assignment

Related Posts

Final Project

Lab 09 | Bivariate Analysis

Lab 10 | Agent Based Modeling (GEOG 613)