Principal Component Analysis

Eigenvectors, Eigenvalues, Identity Matrices

It is helpful to have some background in linear algebra. For our purposes, though, we’ll start with a remote sensing example.

A visual example

To start our exploration of PCA we will first look at the principal components of an image as it helps to intuitively show what the numbers mean, start by opening a Landsat image and plotting the image in R Studio.

# Load Libraries
library(raster)
library(dplyr)
library(imager)

# Open the image
# On windows path is 'L:/GEOG413....'
image = stretch(stack('prince_george_lc8.tif'))

#In landsat 8 images, red is band 4, green 3, and blue 2.
plotRGB(image, r = 4, g = 3, b = 2)

#plot all 7 bands. 
plot(image)

At this point, you should be able to see an image of Prince George in 7 different wavelengths as captured by Landsat 8. Before the image can have the PCA calculated it must be converted to a data frame with only 2 dimensions (currently we have x, y, and band, we need xy, and band).

dim(image)
image.df <- as.data.frame(image)
dim(image.df)

#If your dimensions are reduced you can now compute the Principal Components of the image using the dplyr library

pca <- prcomp(image.df)

#look at both a summary of the results, as well as the eigenvalues produced.
summary(pca)
pca

First, take a look at the output of the summary

Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     27.5248 21.8380 6.90914 2.79606 1.64263 1.05053 0.60443
Proportion of Variance  0.5854  0.3685 0.03688 0.00604 0.00208 0.00085 0.00028
Cumulative Proportion   0.5854  0.9539 0.99074 0.99678 0.99887 0.99972 1.00000

In this table we can see there are 7 principal components across the top (why are there 7?). They are shown along with the Standard Deviation of pixel intensities, the Proportion of Variance represented, and the Cumulative Proportion of variance. This is telling us, that Principal Component 1 represents 0.5854 of the variation in pixel intensities, and combined with Principal Component 2 this is 0.9539% of the variation. 

Next, to understand what is happening we will look at the eigen vectors

Standard deviations (1, .., p=7):
[1] 27.5247652 21.8379751  6.9091408  2.7960560  1.6426283  1.0505273  0.6044305

Rotation (n x k) = (7 x 7):
                          PC1         PC2         PC3          PC4         PC5         PC6          PC7
prince_george_lc8_1 0.2740264 -0.35393469  0.24672391 -0.625697864  0.06229739 -0.05387918 -0.583511215
prince_george_lc8_2 0.2773480 -0.35022900  0.22127703 -0.298588252  0.04854514 -0.22025609  0.781939843
prince_george_lc8_3 0.3511632 -0.30326703  0.19843505  0.337055897 -0.69829000  0.37771787 -0.038087634
prince_george_lc8_4 0.3604207 -0.37716566  0.01484489  0.607506489  0.52145955 -0.24084764 -0.169206504
prince_george_lc8_5 0.5994954  0.71170044  0.35011765 -0.000371049  0.10429454  0.02400220  0.007198983
prince_george_lc8_6 0.4066549  0.07406538 -0.72124920 -0.095773666 -0.31641847 -0.44413104 -0.048991043
prince_george_lc8_7 0.2645867 -0.07643204 -0.45583349 -0.165842375  0.35104902  0.74167205  0.124704801

This table is showing a matrix of each band of input, along with each Principal component, the vectors provided here allow for conversion between the bands and the Eigen channels, showing how much weight each input has on the channel and vice versa. 

For example, the eigenvectors are stored in the $rotation variable inside the PCA object, we can reconstruct band 6 with the following code.

image.pca <- pca$x
image$pc1 <- image.pca[,1]
image$pc2 <- image.pca[,2]
image$pc3 <- image.pca[,3]
image$pc4 <- image.pca[,4]
image$pc5 <- image.pca[,5]
image$pc6 <- image.pca[,6]
image$pc7 <- image.pca[,7]


plot(image$pc1, axes = FALSE)
plot(image$pc2, axes = FALSE)
plot(image$pc3, axes = FALSE)
plot(image$pc4, axes = FALSE)
plot(image$pc5, axes = FALSE)
plot(image$pc6, axes = FALSE)
plot(image$pc7, axes = FALSE)

pca$rotation[1, 1]

band  =  6
plot((image$pc1 * pca$rotation[band, 1] + image$pc2 * pca$rotation[band, 2] + image$pc3 * pca$rotation[band, 3] + 
        image$pc4 * pca$rotation[band, 4] + image$pc5 * pca$rotation[band, 5] + image$pc6 * pca$rotation[band, 6] +
        image$pc7 * pca$rotation[band, 7] ))

Though we could also get back 99.07% of band 6 with this code

band  =  6
plot((image$pc1 * pca$rotation[band, 1] + image$pc2 * pca$rotation[band, 2] + image$pc3 * pca$rotation[band, 3]))

PCA using the Iris dataset

The Iris dataset was used by a Fisher for Taxonomy classification. It includes three iris species with 50 samples each as well as some properties about each flower. The dataset is available in R respostory and is widely used to demonstrate classifation methods.

library(datasets)
library(psych)
library(dplyr)

#Load the data
data(iris)

#summary stats on the data
describe(iris)

In general the variables used in PCA will not be on the same scale yet the results from PCA depend on the scale of the variables. Therefore, each variable is typically centered and scaled to have a mean of zero and standard deviation of one. It is possible, however, that variables are measured in the same units, and one may skip the standardization. Here a couple of the functions are used to scale and center the variables.

# Function for centering a vector
center <- function(v) {
  v - mean(v)
}
# Function for scaling a vector
scale <- function(v) {
  v / sd(v)
}

iris_pca_centered <- apply(iris_pca, MARGIN = 2, FUN = center)
iris_pca_scaled <- apply(iris_pca_centered, MARGIN = 2, FUN = scale)

#compare the descriptive statistics for the scaled data with the original
describe(iris_pca)
describe(iris_pca_scaled)


# Compute eigenvalues and eigenvectors of covariance matrix (= correlation matrix) of our scaled variables
iris_pca_cov <- cov(iris_pca_scaled)
iris_pca_eigen <- eigen(iris_pca_cov)

# Access the eigenvectors
rownames(iris_pca_eigen$vectors) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") # label output matrix
colnames(iris_pca_eigen$vectors) <- c("PC1", "PC2","PC3", "PC4") # label output matrix
iris_pca_eigen$vectors

# Access the eigenvalues
iris_pca_eigen$values

#To compute the proportion of variance explained by each principal component,
iris_pca_ve <- iris_pca_eigen$values / sum(iris_pca_eigen$values)
iris_pca_ve

par(mfrow = c(1, 2), mar = c(4, 5, 3, 1))
plot(iris_pca_ve,
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b",
     main = "Scree plot")

plot(cumsum(iris_pca_ve),
     xlab = "Principal Component",
     ylab = "Cumulative Proportion of\nVariance Explained",
     ylim = c(0, 1),
     type = "b",
     main = "Scree plot")


pca_loading <- iris_pca_eigen$vectors
#to calculate the scores for the data set
Z <- iris_pca_scaled %*% pca_loading
Z
#Z represents the principle component scores

Assignment

For this assignment, you will be calculated the PCA of a livability index based on four factors from the census, (Education -Secondary…, Education – Postsecondary…., Labour – Did not work, Labout – Worked). The data is available in the L drive with the data table post_college.csv, and the geometry censu_da_subset/bc_da_2016.shp.

# Open post_college census data, 
livelihood_data <-

head(livelihood_data)
#      COL0  COL1  COL2  COL3  COL4  COL5  COL6  COL7  COL8
#     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 59510143   143    70   130   150   170     0    90   255
#2 59510144   144   115   100    90   160    25   180   120
#3 59510148   148    45    95   210   130    15    90   255
#4 59510149   149    35   135   210   165    10    95   285

#select the columns of interest. Please note that COL0 is the DAUID, COL1 is unnecessary (its just another ID that you should ignore)

#use COL0 to assign row names (remember the rowname function). For our purposes we can ignore the warning about the rowname function being deprecated. use the view function make sure the rows are labeled

#rename the other columns appropiately (remember the colname function)

#
pca_livelihood <-
head(pca_livelihood)
#  NoEdu YesEdu YesDeg HSGLess30 HSGreater30 Unemp   Emp
#  <dbl>  <dbl>  <dbl>     <dbl>       <dbl> <dbl> <dbl>
#1    70    130    150       170           0    90   255
#2   115    100     90       160          25   180   120
#3    45     95    210       130          15    90   255
#4    35    135    210       165          10    95   285

# Standardize the data (center and scale)
pca_livelihood.centered <-
pca_livelihood.scaled <-

# Calculate covariance and eigen values
pca_livelihood.eigen <- 

# Calculate the cumulative proportion of variance for each component
pca_livelihood.ve <- 

# Display above results in scree plots


# Based on scree plots how many principle components are significant?

# Put all the components in the the variable below, and name the rows and columns
pca_livelihood.loading  <-

head(pca_livelihood.loading)
#				PC1			PC2			…	PC7
#NoEdu			-0.3416356	0.369629	…	2.26E-01
#YesEdu			-0.4165065	-0.1041928	…	3.15E-01
#YesDeg			-0.3996926	-0.3336258	…	5.05E-01
#HSLess30		-0.4232683	-0.2421553	…	3.05E-03
#HSGreater30	-0.216916	0.6855366	…	8.20E-05
#Unemp			-0.3810779	0.347416	…	-2.85E-01

# Use Matrix Multiplication to create the PCA channels
pca_livelihood.scores <-


head(pca_livelihood.scores)
#			PC1			PC2			…	PC7
#59510143	1.14618		-0.9153936	…	0.025785637
#59510144	0.8983803	1.2864345	…	0.030784241
#59510148	1.4573599	-0.7683981	…	0.02632107
#59510149	0.9583859	-1.2054848	…	-0.00041698
#59510151	0.7052388	-1.1805916	…	-0.027598662
#59510152	1.1551418	-0.5833508	…	-0.001390574


# If you did not set your rows already as shown about, please ret rownames to DAUID to allow merging tables
rownames(pca_livelihood.scores) <- rownames(pca_livelihood)

# Print livelihood scores
pca_livelihood.scores 



# Use cencus boundaries as simple feature, plot with ggplot2
library(sf)
library(ggplot2)
library(leaflet)

# Filter data on load to avoid kernel crash, filter provided below. 
# make sure you have the dplyr library(dplyr) is loaded
da <- read_sf(dsn = 'censu_da_subset', layer = 'bc_da_2016') %>% 
  filter(DAUID %in% rownames(pca_livelihood.scores) )

# JOIN da and PCA channels
DAUID <-rownames(pca_livelihood.scores)
pca_livelihood.scores <- cbind(pca_livelihood.scores, DAUID)
da_pca <- merge(da, pca_livelihood.scores, by = "DAUID")
 

#ggplot PC1 and PC2, scatter plot showing PC1 on x-axis and y-axis
# use geom_vline(xintercept = 0) and 
# geom_hline(yintercept = 0) to show the vertical and horizontal lines of origins


#Leaflet Plot PC1, to five classes
#see code below for set up
  #tm_polygons("PC1", palette = rev(hcl.colors(7, "ag_GrnYl"))) +
  #tmap_options(max.categories = 5)