In today’s lab, we will be looking at some of the more fundamental exploratory analysis tools. Cluster Analysis (using K-Means), and Quadrat Analysis.
K-Means Clustering
Clustering algorithms are useful for separating data into groups with similar qualities, clustering is often not the end goal in and of itself, but it does provide a means to start looking for patterns, once groups have been formed it is then possible to take a deeper look into what is driving those clusters allowing us to find patterns in the attributes.
Below is a demo of the K-Means algorithm created using shiny, much like we made our maps in lab 5. This demo allows you to step through the iterations to see the impacts.
https://connect.gis.unbc.ca/content/8453154e-321e-48eb-9304-7375277ae188
In this example, we are working in 2D space that is to say 2 attributes, and distance is calculated as the distance between point p, and cluster k. This is relatively easy to understand, and provides a helpful visual.
Firstly, let’s look at the Guerry dataset documentation file to make sure we understand the data.
library(tidyverse) library(sf) library(rgdal) library(leaflet) guerry_shp <- st_read("guerry/guerry.shp") ggplot() + geom_sf(data = guerry_shp, color = 'blue', aes(geometry = geometry)) #to perform K-means analysis we need to make covert the data from the simple feature format to a dataframe head(guerry_shp) guerry_df <- as.data.frame(guerry_shp) head(guerry_df) # Compute K-Means num_clusters <- 5 guerry_ke <- kmeans(guerry_df[, c('Crm_prs', 'Crm_prp', 'Litercy', 'Donatns', 'Infants', 'Suicids')], num_clusters) # Bind cluster to input data !!! Do Not Sort Clusters !!! guerry_clusters <- cbind(guerry_shp, guerry_ke$cluster) # Convert to sf for leaflet guerry_sf <- st_as_sf(guerry_clusters) # Rename cluster column colnames(guerry_sf)[ncol(guerry_shp)] <- "cluster" # Make colour palate cols = rainbow(num_clusters) # Show Leaflet map leaflet(guerry_sf %>% st_transform(4326)) %>% # Leaflet does not understand UTM! addProviderTiles("CartoDB.Positron") %>% # Add basemap addPolygons(color = cols, fillOpacity = 1) # Show different clusters as different colours
One more tip, as the clusters start in random locations, the output may vary slightly based on the starting state, if you need consistent output for testing purposes you can ‘seed’ the random number generator, before calling kmeans() e.g #set.seed(123)
Quadrat Analysis
Quadrat Analysis allows for exploratory searching of patterns in locations of points. In today’s lab we will be looking at the location of crashes on the lower mainland, using the spatStat package. *Note that spatStat does not natively work with sf or ggplot, so our code will look a little different
Like always we start by loading some data, this is a three-step process first we open the data, then convert the coordinates into locations, finally converting the sf data into spatial point patterns (ppp).
crash_data <- read.csv2("Lower_mainland_Crashes_expanded.csv") %>% select(Latitude, Longitude, Location) #Note the use of read.csv2 vs the read_csv function we have used previously #Convert the table to sf objects, note data is in WGS84 CRS (map projection) crash_locations <- st_as_sf(crash_data, coords = c("Longitude", "Latitude"), crs = 4326) %>% st_transform(32610) # SpatStat requires data as ppp, transform to NAD 83 UTM Zone 10N. Never # calculate statistics on unprojected data! crash_points <- as.ppp(crash_locations %>% st_transform(32610)) # Minimize plot margins to help see points, set default point to red # Strange bug where col in plot() changes background not point colour so moved to par() function par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(crash_points, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F) # Count quadrats Q <- quadratcount(crash_points, nx= 5, ny=5) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q, main='q-plot', col = alpha('black', 1), add=T)
Once all the points have been counted the next step is the chi-squared test, done with the quadrat.test() function.
# Chi-squared test qtest <- quadrat.test(Q) # Print results to console qtest # Optionally plot quadrats (count, expected count, deviation) plot(qtest, cex = .7, col = 'Black')
Lets talk about the output.
#Chi-squared test of CSR using quadrat counts #data: #X2 = 140895, df = 24, p-value < 2.2e-16 #alternative hypothesis: two.sided #Quadrats: 5 by 5 grid of tiles
The interpretation of this is that Chi-squared (X2) is larger than the critical value (df), suggesting that this data is clustered; as the p-value is less than 0.05 we can take these results to be statistically significant.
Data Prep is Important
You may have noticed the data we loaded was called Lower_mainland_Crashes_expanded. This expanded dataset contains extra points for each extra crash at that location. However, these extra crashes are listed in a column as a count of crashes at that location on the original data. If you try the above again changing to the original data how do your results differ?
crash_data <- read_csv("Lower_mainland_Crashes_data.csv") # how does the chi-square value for this dataset differ from the expanded one? What does this tell us.
Looking Through the Window
Another interesting point in these data is that there is a lot of ocean in some of the quadrats, and cars generally do not crash in the ocean. We can try to account for this by applying a window to the quadrat analysis.
# Open lowermainland shp as window outline <- st_read(dsn = "lower_mainland", layer = 'lower_mainland') crash_data <- read_csv("Lower_mainland_Crashes_data.csv") %>% select(Latitude, Longitude, Location) # Convert spreadsheet to sf objects, note data is in WGS84 CRS crash_locations <- st_as_sf(crash_data, coords = c("Longitude", "Latitude"), crs = 4326) %>% st_transform(32610) # SpatStat requires data as ppp, transform to UTM 10N. crash_points <- as.ppp(crash_locations %>% st_transform(32610)) # Set window on crash points window <- as.owin(outline) crash_points_window <- crash_points[window] # Count quadrats Q_window <- quadratcount(crash_points_window, nx=5, ny=5) # Minimize plot margins to help see points, set default point to red par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(crash_points_window, pch=20, cex = 0.5, main='q_window-plot', legend = F, label_value = F) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q_window, main='q_window-plot', col = alpha('black', 1), add=T) # Chi-squared test q_window_test <- quadrat.test(Q_window) q_window_test plot(q_window_test, cex = 0.7, col = 'black')
More on the quadrats (windows) using synthetic data
synth_data_points <- st_read(dsn = "synthetic_dataset", layer = 'random_points') #SpatStat requires data as ppp, but you will notice points are already in UTM Zone 10 so no need to do a coordinate transformation synth_data_points <- as.ppp(synth_data_points) #define the number of quadrats (grid size for x and y axis) wx <- 8 wy <- 8 Q_synth_window <- quadratcount(synth_data_points, nx= wx, ny=wy) # Minimize plot margins to help see points, set default point to red par(plt=c(0, 1, 0, 1), bg = 'white', col = alpha('red', 0.5)) # Plot points plot(synth_data_points, pch=20, cex = 0.5, legend = F, label_value = F) # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(Q_synth_window, col = alpha('black', 1), add=T) q_synth_test <- quadrat.test(Q_synth_window) #see the results of your q_synth_test ##LETS EXPLORE #the expected frequency would be the total number of cases(counts) divided by the number of quadrats ef <- synth_data_points$n/(wx*wy) #we need weigh the quadrat frequencies by using the square of the deviations divided by the expected frequency. If the sum is small then there is no need to reject the null hypothesis. If it is big then we reject the null hypothesis qw <- as.data.frame(Q_synth_window) dS <- sum(((qw$Freq-ef)^2)/ef) dS #Repeat the above with in different grid sizes 8 by 8; 7 by 7,..., 2 by 2 and see the results
Assignment
Answer the 3 questions below adding the missing lines to the code.
Submit your completed R script and screenshots of plots produced (or right click on graphs and save as to download to your computer). Make sure to caption the plots in the pdf document.
(there is more here on color palettes)
library(tidyverse) library(sf) library(sp) library(leaflet) library(spatstat) library(dplyr) library(RColorBrewer) ################################################ ### Question 1: Which city in the layer 5CCITIES ### stands out distinctily based upon the K-means analysis Marital Status ### NEVERMARRY, MARRIED, SEPARATED, WIDOWED, DIVORCED ### Use 5 clusters in K-Means ################################################ # Open data five_ccities <- st_read(dsn = 'assignment', layer = '5CCITIES') # Convert to DataFrame five_ccities_df <- #... # K-Means Cluster five_ccities_ke <- #... # Bind clusters, and name column five_ccities_clusters <- cbind(...) colnames(five_ccities_clusters)[ncol(five_ccities)] <- "cluster" # Convert to spatial dataframe for leaflet compatability and assign CRS five_cities_sf <- st_as_sf(five_ccities_clusters) st_crs(five_cities_sf) <- st_crs(4326) # Create colour pallet for leaflet cols = brewer.pal(num_clusters, "Set2") pal <- colorFactor(cols, five_cities_sf$cluster) # Show Leaflet map # Show Leaflet map leaflet(five_cities_sf) %>% addProviderTiles("CartoDB.Positron") %>% # Add basemap addCircleMarkers(col = pal(five_cities_sf$cluster), fillOpacity = 1, radius = 3) # Show different clusters as different colours # Display how many points in each cluster five_cities_sf %>% group_by(cluster) %>% summarize(num_points = n()) ################################################ ### Question 2: Which two cities in the layer 5CCITIES ### stand out the most based upon housing status ### VACANT, OWNER_OCC, RENTER_OCC ### Normalize columns by HSE_UNITS ### Use 5 clusters in K-Means ################################################ # Open data five_ccities <- st_read(dsn = 'assignment', layer = '5CCITIES') # Convert to DataFrame # normalize the selected variables norm_five_ccities_df <- five_ccities_df[, c('VACANT', 'OWNER_OCC', 'RENTER_OCC')] * 1/five_ccities_df[, c('HSE_UNITS')] # K-Means Cluster # Bind clusters, and name column # Convert to spatial dataframe for leaflet compatability # Create colour pallet for leaflet # Show Leaflet map # Display how many points in each cluster ################################################ ### 3 Pts ### Question 3: emap, and dipin layers both contain ### information on water quality in lakes; using ### quadrat analysis determine how these layers ### compare in terms of point distribution ################################################ # Open Data dipin <- st_read(dsn = 'assignment', layer = 'dipin') emap <- st_read(dsn = 'assignment', layer = 'emap') # You will notice the shapefiles do not have CRS defined. # Assign CRS 4326 which will enable us to reproject to UTM. See the code in Qn 1 above st_crs(dipin) <- st_crs(emap) <- # Transform to NAD83 / UTM zone 19N, EPSG:26919 dipin_proj <- st_transform(dipin,26919) emap_proj <- st_transform(emap,26919) # Create spatial point patterns # Create window of North East States #read data ne_usa <- st_read(dsn = 'assignment', layer = 'ne-usa') #assign WGS84 projection. st_crs(ne_usa) <- st_crs(4326) #project to UTM ne_usa_proj <- st_transform(...) outline <- st_union(st_combine(ne_usa_proj)) window <- as.owin(outline) # Set window on ppp data # Plot points par(plt=c(0, 1, 0, 1), bg = 'white', col = 'red') plot(dipin_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F, which.marks=1) par(col='blue') plot(emap_points_window, pch=20, cex = 0.5, main='q-plot', legend = F, label_value = F, which.marks=1, add=T) # Count quadrats # Fade points to make counts readable rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col=alpha('white', 0.7)) # Add quadrat grid plot(dipin_qc, main='q-plot', col = alpha('black', 1), add=T) # Chi-squared test # Print results to console # Optionally plot quadrats (count, expected count, deviation) plot(dipin_chi2, cex = .5, col = 'Black') #### Both dipin and emap have same critical value, both are clustered, and have #### good p-values, however we can say that dipin is more clustered than e-map #### an interesting result given the obvious cluster in emap in North West region