In today’s lab, we will be looking at some of the more fundamental exploratory analysis tools. Cluster Analysis (using K-Means), and Quadrat Analysis,
K-Means Clustering
Clustering algorithms are useful for separating data into groups with similar qualities, clustering is often not the end goal in and of itself, but it does provide a means to start looking for patterns, once groups have been formed it is then possible to take a deeper look into what is driving those clusters allowing us to find patterns in the attributes.
Below is a demo of the K-Means algorithm created using shiny, much like we made our maps in lab 5. This demo allows you to step through the iterations to see the impacts.
https://connect.gis.unbc.ca/content/8453154e-321e-48eb-9304-7375277ae188/
In this example, we are working in 2D space that is to say 2 attributes, and distance is calculated as the distance between point p, and cluster k is.
\[d\ =\ \sqrt{\left(p.x-k.x\right)^2\ +\left(p.y\ -\ k.y\right)^2}\]
This is relatively easy to understand, when moving into n columns this formula will change to
\[\mathrm{d=}\sqrt{\sum_{\mathrm{i=1}}^{\mathrm{n}}\mathrm{p.}\mathrm{i}^{\mathrm{2}}\mathrm{-k.}\mathrm{i}^{\mathrm{2}}}\]
As a refresher, we will do the same exercise we did in class last week.
Open RStudio Workbench (https://rstudio.gis.unbc.ca) and start by importing your libraries
library(tidyverse) library(maptools) library(sf) library(rgdal) library(leaflet)
Next, open the data from the L drive and plot it
guerry_shp <- readOGR(dsn = "~/L/GEOG413/lab07/guerry", layer = 'guerry') plot(guerry_shp)
The K-Means is simple to calculate in R, however it must first be converted into a dataframe
# Convert to data.frame guerry_df <- as.data.frame(guerry_shp) # Compute K-Means num_clusters <- 5 guerry_ke <- kmeans(guerry_df[, c('Crm_prs', 'Crm_prp', 'Litercy', 'Donatns', 'Infants', 'Suicids')], num_clusters)
Currently, we have a list of clusters, but to see what region belongs to each cluster we must bind the computed clusters to our original data.
# Bind cluster to input data !!! Do Not Sort Clusters !!! guerry_clusters <- cbind(guerry_shp, guerry_ke$cluster) # Convert to sf for leaflet guerry_sf <- st_as_sf(guerry_clusters) # Rename cluster column (index is 1 larger than number of input columns) colnames(guerry_sf)[ncol(guerry_shp)+1] <- "cluster"
Finally look at the results on the map
# Make colour palate cols = rainbow(num_clusters) # Show Leaflet map leaflet(guerry_sf %>% st_transform(4326)) %>% # Leaflet does not understand UTM! addProviderTiles("CartoDB.Positron") %>% # Add basemap addPolygons(color = cols, fillOpacity = 1) # Show different clusters as different colours
One more tip, as the clusters start in random locations, the output may vary slightly based on the starting state, if you need consistent output for testing purposes you can ‘seed’ the random number generator, before calling kmeans()
set.seed(42)
Quadrat Analysis
Quadrat Analysis allows for exploratory searching of patterns in locations of points. In today’s lab we will be looking at the location of crashes on the lower mainland, using the spatStat package. *Note that spatStat does not natively work with sf or ggplot, so our code will look a little different than previous labs.
Like always we start by loading some data, this is a three-step process first we open the data, then convert the coordinates into locations, finally converting the sf data into spatial point patterns (ppp).
# Start of Lab crash_data <- read.csv2("~/L/GEOG413/lab07/Lower_mainland_Crashes_expanded.csv") %>% select(Latitude, Longitude, Location) # Convert spreadsheet to sf objects, note data is in WGS84 CRS crash_locations <- st_as_sf(crash_data, coords = c("Longitude", "Latitude"), crs = 4326) %>% st_transform(32610) # SpatStat requires data as ppp, transform to UTM 10N. Do not # Calculate statistics on unprojected data! crash_points <- as.ppp(crash_locations %>% st_transform(32610))
Examine the data by plotting
# Minimize plot margins to help see points, set default point to red # Strange bug where col in plot() changes background not point colour so moved to par() par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(crash_points, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F)
Next, to perform quadrat analysis the data needs to be split into quadrats, and the points inside counted, note that the size of your quadrats will affect the answer, we will talk more about this; for now to make the plots easy to see we will work with a 5×5 grid of quadrats.
# Count quadrats Q <- quadratcount(crash_points, nx= 5, ny=5) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q, main='q-plot', col = alpha('black', 1), add=T)
Once all the points have been counted the next step is the chi-squared test, done with the quadrat.test() function.
# Chi-squared test qtest <- quadrat.test(Q) # Print results to console qtest # Optionally plot quadrats (count, expected count, deviation) plot(qtest, cex = .7, col = 'Black')
Lets talk about the output
Chi-squared test of CSR using quadrat counts data: X2 = 140895, df = 24, p-value < 2.2e-16 alternative hypothesis: two.sided Quadrats: 5 by 5 grid of tiles
The interpretation of this is that Chi-squared (X2) is larger than the critical value (df), suggesting that this data is clustered; as the p-value is less than 0.05 we can take these results to be statistically significant.
Data Prep matters
You may have noticed the data we loaded was called Lower_mainland_Crashes_expanded; what does this mean? On the original data each point contained a count of crashes at that location, the expanded dataset contains extra points for each extra crash. If you try the above again changing to the original data how do your results differ?
crash_data <- read.csv2("~/L/GEOG413/lab07/Lower_mainland_Crashes_data.csv", sep = ',')
The original data has a smaller Chi-squared, what does that tell us about the data?
Looking Through the Window
Another interesting point in this data is that there is a lot of ocean, and cars generally do not crash in the ocean. We can account for this by applying a window to the quadrat analysis.
# Open lowermainland shp as window outline <- readOGR(dsn = "~/L/GEOG413/lab07/lower_mainland", layer = 'lower_mainland') crash_data <- read.csv2("~/L/GEOG413/lab07/Lower_mainland_Crashes_expanded.csv") %>% select(Latitude, Longitude, Location) # Convert spreadsheet to sf objects, note data is in WGS84 CRS crash_locations <- st_as_sf(crash_data, coords = c("Longitude", "Latitude"), crs = 4326) %>% st_transform(32610) # SpatStat requires data as ppp, transform to UTM 10N. crash_points <- as.ppp(crash_locations %>% st_transform(32610)) # Set window on crash points window <- as.owin(outline) crash_points_window <- crash_points[window] # Count quadrats Q_window <- quadratcount(crash_points_window, nx=5, ny=5) # Minimize plot margins to help see points, set default point to red par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(crash_points_window, pch=20, cex = 0.5, main='q_window-plot', legend = F, label_value = F) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q_window, main='q_window-plot', col = alpha('black', 1), add=T) # Chi-squared test q_window_test <- quadrat.test(Q_window) q_window_test plot(q_window_test, cex = 0.7, col = 'black')
In this case, you can see that the data is appearing less clustered and that the expected counts are not the same for all quadrats as they are scaled based on area.
Evenly Spaced Data
regular_points <- readOGR(dsn = "~/L/GEOG413/lab07/synthetic_data", layer = 'regular_points') regular_locations <- st_as_sf(regular_points) # SpatStat requires data as ppp, transform to UTM 10N. crash_points_regular <- as.ppp(regular_locations %>% st_transform(32610)) # Open lowermainland shp as window outline <- readOGR(dsn = "~/L/GEOG413/lab07/lower_mainland", layer = 'lower_mainland') plot(outline) window <- as.owin(outline) # Set window on crash points crash_points_regular_window <- crash_points_regular[window] # Count quadrats Q_regular_window <- quadratcount(crash_points_regular_window, nx= 5, ny=5) # Minimize plot margins to help see points, set default point to red par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(crash_points_regular_window, pch=20, cex = 0.5, main='q_window_regular-plot', legend = F, label_value = F) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q_regular_window, main='q_window_regular-plot', col = alpha('black', 1), add=T) q_regular_window_test <- quadrat.test(Q_regular_window) q_regular_window_test plot(q_regular_window_test, cex = 0.5, col = alpha('black', 1))
Chi-squared test of CSR using quadrat counts data: X2 = 7.8595, df = 21, p-value = 0.008577 alternative hypothesis: two.sided Quadrats: 22 tiles (irregular windows)
Randomly Distributed Data
random_points <- readOGR(dsn = "~/L/GEOG413/lab07/synthetic_data", layer = 'random_points') random_locations <- st_as_sf(random_points) # SpatStat requires data as ppp, transform to UTM 10N. crash_points_random <- as.ppp(random_locations %>% st_transform(32610)) # Open lowermainland shp as window outline <- readOGR(dsn = "~/L/GEOG413/lab07/lower_mainland", layer = 'lower_mainland') plot(outline) window <- as.owin(outline) # Set window on crash points crash_points_random_window <- crash_points_random[window] # Count quadrats Q_random_window <- quadratcount(crash_points_random_window, nx= 5, ny=5) # Minimize plot margins to help see points, set default point to red par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(crash_points_random_window, pch=20, cex = 0.5, main='q_window_random-plot', legend = F, label_value = F) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q_random_window, main='q_window_random-plot', col = alpha('black', 1), add=T) q_random_window_test <- quadrat.test(Q_random_window) q_random_window_test plot(q_random_window_test, cex = 0.7, col = alpha('black', 1))
Chi-squared test of CSR using quadrat counts data: X2 = 23.694, df = 21, p-value = 0.6163 alternative hypothesis: two.sided Quadrats: 22 tiles (irregular windows)
Assignment
Answer the 3 questions below adding the missing lines to the code.
Submit your completed R script and screenshots of plots produced (or right click on graphs and save as to download to your computer). Make sure to caption the plots in a word document as the labels have been hidden to give more space.
library(tidyverse) library(maptools) library(sf) library(sp) library(rgdal) library(leaflet) library(spatstat) library(dplyr) library(rgeos) ################################################ ### Question 1: Which city in the layer 5CCITIES ### is the most unique based upon Marital Status ### NEVERMARRY, MARRIED, SEPARATED, WIDOWED, DIVORCED ### Use 5 clusters in K-Means ################################################ # Open data five_ccities <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = '5CCITIES') # Convert to DataFrame # K-Means Cluster # Bind clusters, and name column colnames(city_clusters@data)[length(five_ccities@data) + 1] <- 'clusters' # Convert to spatial dataframe for leaflet compatability city_sf <- st_as_sf(city_clusters) # Create colour pallet for leaflet pal <- colorFactor(rainbow(nrow(city_sf)), domain = city_sf$clusters) # Show Leaflet map # Display how many points in each cluster city_sf %>% group_by(clusters) %>% summarize(num_points = n()) ################################################ ### Question 2: Which two cities in the layer 5CCITIES ### is the most unique based upon housing status ### VACANT, OWNER_OCC, RENTER_OCC ### Normalize columns by HSE_UNITS ### Use 5 clusters in K-Means ################################################ # Open data five_ccities <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = '5CCITIES') # Convert to DataFrame # K-Means Cluster # Bind clusters, and name column city_clusters <- cbind(five_ccities, klusters$cluster) colnames(city_clusters@data)[length(five_ccities@data) + 1] <- 'clusters' # Convert to spatial dataframe for leaflet compatability city_sf <- st_as_sf(city_clusters) # Create colour pallet for leaflet pal <- colorFactor(rainbow(nrow(city_sf)), domain = city_sf$clusters) # Show Leaflet map # Display how many points in each cluster city_sf %>% group_by(clusters) %>% summarize(num_points = n()) ################################################ ### 3 Pts ### Question 3: emap, and dipin layers both contain ### information on water quality in lakes; using ### quadrat analysis determine how these layers ### compare in terms of point distribution ################################################ # Open Data dipin <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = 'dipin') emap <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = 'emap') # Shapfiles do not have CRS defined, define to allow transformation proj4string(dipin) <- proj4string(emap) <- # Transform to psuedo-webmercator projection dipin_proj <- emap_proj <- # Create spatial point patterns # Create window of North East States ne_usa <- readOGR(dsn = '~/L/GEOG413/lab07/assignment', layer = 'ne-usa') proj4string(ne_usa) <- CRS("+init=epsg:4326") ne_usa_proj <- spTransform(ne_usa, CRS("+init=epsg:3857")) outline <- gUnaryUnion(ne_usa_proj) window <- as.owin(outline) # Set window on ppp data # Plot points par(plt=c(0, 1, 0, 1), bg = 'white', col = 'red') plot(dipin_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F, which.marks=1) par(col='blue') plot(emap_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F, which.marks=1, add=T) # Count quadrats # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(dipin_qc, main='q-plot', col = alpha('black', 1), add=T) # Chi-squared test # Print results to console # Optionally plot quadrats (count, expected count, deviation) plot(dipin_chi2, cex = .5, col = 'Black') #### Both dipin and emap have same critical value, both are clustered, and have #### good p-values, however we can say that dipin is more clustered than e-map #### an interesting result given the obvious cluster in emap in North West region