Basic Analytics in R

data.frames and tibble

When working with tabular data, generally each row represents a record of the phenomena of interest (such as a spatial object, a person…) and each columns represents an attribute of that feature.  In R, data can be in a matrix, but matrices can only hold one data type (e.g. integer, logical, character). data.frames and tibble data structures hold different data types in different columns. A tibble is in fact an improvement on the data frame structure by keeping its processing advantages while addressing the limitations.

Try: 

df <- data.frame(pop2021 = c(76708, 9889,93203,662248,144576), city=c("Prince George", "Quesnel", "Chilliwack", "Vancouver", "Kelowna"))
tb <- tibble(pop2021 = c(76708, 9889,93203,662248,144576), city=c("Prince George", "Quesnel", "Chilliwack", "Vancouver", "Kelowna")) 

df$ci

tb$ci

Looking at the output above, do you see how data.frame can be problematic?

working with data files

Install the tidyverse package and load it in your script. Load the pgTrees.csv file using the read_csv() function which is part of the readr library in tidyverse. The view() function call in the console is added by RStudio to show your tibble in the viewer.

library(tidyverse)
pgTrees <- read_csv("pgTrees.csv")

How many trees are in the table? How many variables are there? Apply the nrow() and ncol() functions on the data.

Next, use R’s built-in functions to do some basic plotting.

# using the basic function
plot(pgTrees$Eastings, pgTrees$Northings)

# specify the aspect ratio, and labels
plot(pgTrees$Eastings, pgTrees$Northings, asp=1, col='red', pch=16, xlab="Easting", ylab="Northing")

#the title is added after the plot function is called
title('Trees in PG')

R’s built-in plot functions and parameters are specific to the type of data passed to them. ggplot2 is a library that provides better graphics and it allows one to handle each part of the plot as a separate layer. If you have installed tidyverse, ggplot2 is included.

#Using ggplot2
ggplot(pgTrees, aes(x = Eastings, y = Northings)) +
  geom_point()
  
# To add labels and fix the scale
ggplot(pgTrees, aes(x = Eastings, y = Northings)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "GGPLOT", x = "Eastings", y = "Northings") +
  theme_minimal() +
  coord_fixed()

# if you wanted to fit a  trend line (these data are not )
ggplot(pgTrees, mapping = aes(x = DBH, y = TreeAge)) +
  geom_point() +
  geom_smooth(method = 'lm', col='red', fill = "lightsalmon") +
  labs(title = "GGPLOT", x = "DBH", y = "Tree Age") +
  theme_minimal() +
  coord_fixed()
  

#create a histogram. try different binwidths: 2,4,8,16,32
ggplot(pgTrees, aes(DBH)) +
  geom_histogram(binwidth = 4)

  
#you can also also plot a density curve. Try this with and without the histogram 
ggplot(pgTrees, aes(DBH)) +
  geom_histogram(aes(y=..density..), colour="black", fill="white", binwidth = 0.5) +
  geom_density(alpha=.2, fill="#FF6666")
  

The dplyr library provides a powerful set of tools to manipulate data.

# find the average DBH in certain locations
pgTrees %>%
  group_by(Location) %>%
  dplyr::filter(Location == 'Cemetery' | Location == 'Connaught Hill Park')%>%
  dplyr::summarize(Mean = mean(DBH, na.rm = TRUE))
      
# find the standard deviation
pgTrees %>% 
  dplyr::summarize(SD = sd(DBH,na.rm = TRUE))
      
      
      
# find a ratio of tree age to DBH
pgTrees2  <- pgTrees %>%
  mutate(
    across(TreeAge, ~.x/DBH, .names = "DBH_.{.col}")
  )

Assignment (complete and send a pdf file with your answers):

Using the Sacramento real estate sales data:

  1. Using ggplot create a scatter plot of the locations
  2. calculate the total sales during the study period
  3. calculate the average price in North Highlands, Elk Grove, and Sacramento
  4. graphically show the relationship between the number of bedrooms and the sale price for the whole data set
  5. What are the coordinates of the weighted mean center, based on price, for this dataset?