The final project is worth 20% of your course mark, 4% for each question.

Question 1

A key element in the k-means analysis is the choice of the appropriate number of clusters, k. Unless otherwise required, the analyst will try several values for k and the resulting clusters are then compared with respect to the withinsumofsquares as well as the betweensumofsquares.

For this analysis you will use the within-sum-of-squares (elbow method) to find the appropriate number of clusters. As the value of k increases, the within-sum-of-squares with stop changing significantly.

5CCITIES is a 1990 dataset with a selection of variables for several cities in Ohio, US. Not much is known about these cities at the time. However, the task before you is to create several groups based on the census related variables. 

Method of analysis will be K-means.

Create a table and corresponding line graph showing the values k alongside their respective within-sum-of-squares.


  • A table showing k values alongside the within-sum-of-squares
  • A line graph showing k values alongside the within-sum-of-squares
  • A table showing the groups and the cities within each group.
  • Provide a few-sentences commentary on whether these city grouping make sense based on how you see the variables

Question 2

Use FME to add the data for NHL teams, and US counties to your PostgreSQL database. Once loaded, determine the

  • average age of those within 100mi of each team
  • the total number of people within 100mi of each team

Submit your SQL code along with your answers.

Question 3

You are provided with a dataset of elevation points for a study area in Serbia. The points were derived from a 1:5,000 topographic map. The elevation values are in meters and the horizontal units of the coordinate system are also in meters.

Your first task is to create a DEM from these elevation points. You will need to determine a ‘close-to- optimum’ interpolation method based on minimizing the RMSE using cross-validation. A standard rule of thumb for raster resolution is to take the average distance between nearest neighbors divided by 2; rounding to a sensible number.


  1. A clear description of your analysis steps including a justification for your selected method to create a DEM.
  2. You are also provided with a table of control points collected using high accuracy GPS units.

Use the control points to determine the accuracy of the DEM.

  1. A clear description of your analysis steps and the results of your accuracy assessment.

Question 4

You are provided with a shapefile of watersheds in Iowa. The watersheds of interest can be identified by the field CU or NAME. You are also provided with a table of rainfall data (in inches) at monitoring stations for the years 2000, 2001, 2002, 2003 and 2004.

Your task is to develop your best estimate of the average rainfall for each of the CU/NAME watersheds in 2000 and 2004.

Then determine for each watershed the change (+/- %) in average rainfall from 2000 to 2004 using 2000 as the base year.

Create a meaningful thematic map of your result.

Determine the average elevation for each watershed. Is there a dependence relationship between the average elevation and the rainfall for the year 2000? Show your analysis steps.

Hint: Zonal statistics in QGIS and ArcGIS. What is its equivalent in R? (force R to use terra by starting function with terra::

Question 5

You are provided with a dataset showing a handful of variables for counties across mainland US. Your task is to determine a composite variable that can serve as index for health. The variables of interest for this analysis are:

Pop2010: Population in 2010

Pop2010_Sq: Population Density

Pct_Obese: Percentage of population classified as obese

PctPhysIna: Percentage of population that saw a physician in a hospital

PctBlack: Percentage of population – Black

PctWhite: Percentage of population – white

PctMale:Percentage of population –  male

PctFemale:Percentage of population –  female

Categories: GEOG 413Labs