Principal Component Analysis
Eigenvectors, Eigenvalues, Identity Matrices
It is helpful to have some background in linear algebra. For our purposes, though, we’ll start with a remote sensing example.
A visual example
To start our exploration of PCA we will first look at the principal components of an image as it helps to intuitively show what the numbers mean, start by opening a Landsat image and plotting the image in R Studio.
# Load Libraries library(raster) library(dplyr) library(imager) # Open the image # On windows path is 'L:/GEOG413....' image = stretch(stack('prince_george_lc8.tif')) #In landsat 8 images, red is band 4, green 3, and blue 2. plotRGB(image, r = 4, g = 3, b = 2) #plot all 7 bands. plot(image)
At this point, you should be able to see an image of Prince George in 7 different wavelengths as captured by Landsat 8. Before the image can have the PCA calculated it must be converted to a data frame with only 2 dimensions (currently we have x, y, and band, we need xy, and band).
dim(image) image.df <- as.data.frame(image) dim(image.df) #If your dimensions are reduced you can now compute the Principal Components of the image using the dplyr library pca <- prcomp(image.df) #look at both a summary of the results, as well as the eigenvalues produced. summary(pca) pca
First, take a look at the output of the summary
Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 27.5248 21.8380 6.90914 2.79606 1.64263 1.05053 0.60443 Proportion of Variance 0.5854 0.3685 0.03688 0.00604 0.00208 0.00085 0.00028 Cumulative Proportion 0.5854 0.9539 0.99074 0.99678 0.99887 0.99972 1.00000
In this table we can see there are 7 principal components across the top (why are there 7?). They are shown along with the Standard Deviation of pixel intensities, the Proportion of Variance represented, and the Cumulative Proportion of variance. This is telling us, that Principal Component 1 represents 0.5854 of the variation in pixel intensities, and combined with Principal Component 2 this is 0.9539% of the variation.
Next, to understand what is happening we will look at the eigen vectors
Standard deviations (1, .., p=7): [1] 27.5247652 21.8379751 6.9091408 2.7960560 1.6426283 1.0505273 0.6044305 Rotation (n x k) = (7 x 7): PC1 PC2 PC3 PC4 PC5 PC6 PC7 prince_george_lc8_1 0.2740264 -0.35393469 0.24672391 -0.625697864 0.06229739 -0.05387918 -0.583511215 prince_george_lc8_2 0.2773480 -0.35022900 0.22127703 -0.298588252 0.04854514 -0.22025609 0.781939843 prince_george_lc8_3 0.3511632 -0.30326703 0.19843505 0.337055897 -0.69829000 0.37771787 -0.038087634 prince_george_lc8_4 0.3604207 -0.37716566 0.01484489 0.607506489 0.52145955 -0.24084764 -0.169206504 prince_george_lc8_5 0.5994954 0.71170044 0.35011765 -0.000371049 0.10429454 0.02400220 0.007198983 prince_george_lc8_6 0.4066549 0.07406538 -0.72124920 -0.095773666 -0.31641847 -0.44413104 -0.048991043 prince_george_lc8_7 0.2645867 -0.07643204 -0.45583349 -0.165842375 0.35104902 0.74167205 0.124704801
This table is showing a matrix of each band of input, along with each Principal component, the vectors provided here allow for conversion between the bands and the Eigen channels, showing how much weight each input has on the channel and vice versa.
For example, the eigenvectors are stored in the $rotation variable inside the PCA object, we can reconstruct band 6 with the following code.
image.pca <- pca$x image$pc1 <- image.pca[,1] image$pc2 <- image.pca[,2] image$pc3 <- image.pca[,3] image$pc4 <- image.pca[,4] image$pc5 <- image.pca[,5] image$pc6 <- image.pca[,6] image$pc7 <- image.pca[,7] plot(image$pc1, axes = FALSE) plot(image$pc2, axes = FALSE) plot(image$pc3, axes = FALSE) plot(image$pc4, axes = FALSE) plot(image$pc5, axes = FALSE) plot(image$pc6, axes = FALSE) plot(image$pc7, axes = FALSE) pca$rotation[1, 1] band = 6 plot((image$pc1 * pca$rotation[band, 1] + image$pc2 * pca$rotation[band, 2] + image$pc3 * pca$rotation[band, 3] + image$pc4 * pca$rotation[band, 4] + image$pc5 * pca$rotation[band, 5] + image$pc6 * pca$rotation[band, 6] + image$pc7 * pca$rotation[band, 7] ))
Though we could also get back 99.07% of band 6 with this code
band = 6 plot((image$pc1 * pca$rotation[band, 1] + image$pc2 * pca$rotation[band, 2] + image$pc3 * pca$rotation[band, 3]))
PCA using the Iris dataset
The Iris dataset was used by a Fisher for Taxonomy classification. It includes three iris species with 50 samples each as well as some properties about each flower. The dataset is available in R respostory and is widely used to demonstrate classifation methods.
library(datasets) library(psych) library(dplyr) #Load the data data(iris) #summary stats on the data describe(iris)
In general the variables used in PCA will not be on the same scale yet the results from PCA depend on the scale of the variables. Therefore, each variable is typically centered and scaled to have a mean of zero and standard deviation of one. It is possible, however, that variables are measured in the same units, and one may skip the standardization. Here a couple of the functions are used to scale and center the variables.
# Function for centering a vector center <- function(v) { v - mean(v) } # Function for scaling a vector scale <- function(v) { v / sd(v) } iris_pca_centered <- apply(iris_pca, MARGIN = 2, FUN = center) iris_pca_scaled <- apply(iris_pca_centered, MARGIN = 2, FUN = scale) #compare the descriptive statistics for the scaled data with the original describe(iris_pca) describe(iris_pca_scaled) # Compute eigenvalues and eigenvectors of covariance matrix (= correlation matrix) of our scaled variables iris_pca_cov <- cov(iris_pca_scaled) iris_pca_eigen <- eigen(iris_pca_cov) # Access the eigenvectors rownames(iris_pca_eigen$vectors) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") # label output matrix colnames(iris_pca_eigen$vectors) <- c("PC1", "PC2","PC3", "PC4") # label output matrix iris_pca_eigen$vectors # Access the eigenvalues iris_pca_eigen$values #To compute the proportion of variance explained by each principal component, iris_pca_ve <- iris_pca_eigen$values / sum(iris_pca_eigen$values) iris_pca_ve par(mfrow = c(1, 2), mar = c(4, 5, 3, 1)) plot(iris_pca_ve, xlab = "Principal Component", ylab = "Proportion of Variance Explained", ylim = c(0, 1), type = "b", main = "Scree plot") plot(cumsum(iris_pca_ve), xlab = "Principal Component", ylab = "Cumulative Proportion of\nVariance Explained", ylim = c(0, 1), type = "b", main = "Scree plot") pca_loading <- iris_pca_eigen$vectors #to calculate the scores for the data set Z <- iris_pca_scaled %*% pca_loading Z #Z represents the principle component scores
Assignment
For this assignment, you will be calculated the PCA of a livability index based on four factors from the census, (Education -Secondary…, Education – Postsecondary…., Labour – Did not work, Labout – Worked). The data is available in the L drive with the data table post_college.csv, and the geometry censu_da_subset/bc_da_2016.shp.
# Open post_college census data, livelihood_data <- head(livelihood_data) # COL0 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #1 59510143 143 70 130 150 170 0 90 255 #2 59510144 144 115 100 90 160 25 180 120 #3 59510148 148 45 95 210 130 15 90 255 #4 59510149 149 35 135 210 165 10 95 285 #select the columns of interest. Please note that COL0 is the DAUID, COL1 is unnecessary (its just another ID that you should ignore) #use COL0 to assign row names (remember the rowname function). For our purposes we can ignore the warning about the rowname function being deprecated. use the view function make sure the rows are labeled #rename the other columns appropiately (remember the colname function) # pca_livelihood <- head(pca_livelihood) # NoEdu YesEdu YesDeg HSGLess30 HSGreater30 Unemp Emp # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #1 70 130 150 170 0 90 255 #2 115 100 90 160 25 180 120 #3 45 95 210 130 15 90 255 #4 35 135 210 165 10 95 285 # Standardize the data (center and scale) pca_livelihood.centered <- pca_livelihood.scaled <- # Calculate covariance and eigen values pca_livelihood.eigen <- # Calculate the cumulative proportion of variance for each component pca_livelihood.ve <- # Display above results in scree plots # Based on scree plots how many principle components are significant? # Put all the components in the the variable below, and name the rows and columns pca_livelihood.loading <- head(pca_livelihood.loading) # PC1 PC2 … PC7 #NoEdu -0.3416356 0.369629 … 2.26E-01 #YesEdu -0.4165065 -0.1041928 … 3.15E-01 #YesDeg -0.3996926 -0.3336258 … 5.05E-01 #HSLess30 -0.4232683 -0.2421553 … 3.05E-03 #HSGreater30 -0.216916 0.6855366 … 8.20E-05 #Unemp -0.3810779 0.347416 … -2.85E-01 # Use Matrix Multiplication to create the PCA channels pca_livelihood.scores <- head(pca_livelihood.scores) # PC1 PC2 … PC7 #59510143 1.14618 -0.9153936 … 0.025785637 #59510144 0.8983803 1.2864345 … 0.030784241 #59510148 1.4573599 -0.7683981 … 0.02632107 #59510149 0.9583859 -1.2054848 … -0.00041698 #59510151 0.7052388 -1.1805916 … -0.027598662 #59510152 1.1551418 -0.5833508 … -0.001390574 # If you did not set your rows already as shown about, please ret rownames to DAUID to allow merging tables rownames(pca_livelihood.scores) <- rownames(pca_livelihood) # Print livelihood scores pca_livelihood.scores # Use cencus boundaries as simple feature, plot with ggplot2 library(sf) library(ggplot2) library(leaflet) # Filter data on load to avoid kernel crash, filter provided below. # make sure you have the dplyr library(dplyr) is loaded da <- read_sf(dsn = 'censu_da_subset', layer = 'bc_da_2016') %>% filter(DAUID %in% rownames(pca_livelihood.scores) ) # JOIN da and PCA channels DAUID <-rownames(pca_livelihood.scores) pca_livelihood.scores <- cbind(pca_livelihood.scores, DAUID) da_pca <- merge(da, pca_livelihood.scores, by = "DAUID") #ggplot PC1 and PC2, scatter plot showing PC1 on x-axis and y-axis # use geom_vline(xintercept = 0) and # geom_hline(yintercept = 0) to show the vertical and horizontal lines of origins #Leaflet Plot PC1, to five classes #see code below for set up #tm_polygons("PC1", palette = rev(hcl.colors(7, "ag_GrnYl"))) + #tmap_options(max.categories = 5)