- A second look at the postal code and car thefts data
- Moving data from stats can into excel
- cleaning data in wordpad
- joining data in QGIS
Joining Data Properly
Review the Postal Code data
Below is a snippet of the postal code data we worked with last tutorial:
Below is a snippet of the thefts data:
|Postal Code||Address||Street||Location Description||Incident Month||Incident Date||Incident Year||Incident Time||Vehicle Make||Vehicle Model||Vehicle Year|
|V2N2N3||6825||SIMON FRASER AVE||N/A||AUGUST||1||1999||11.00.00||JEEP||CHERO||1988|
|V2L3X3||1600||15TH AVE||PARKWOOD PLACE||AUGUST||8||1999||14.30.00||TOYOTA||L CSR||1987|
|V2L3X3||1600||15TH AVE||PARKWOOD PLACE||AUGUST||4||1999||21.30.00||PONTIAC||FRFLY||1991|
|V2N5A3||7038||ST. ANTHONY CRES||N/A||AUGUST||3||1999||4-30-2000||DODGE||4WHDR||1998|
If we hope that the data we are working with should have Unique Identifiers (the UID we looked at in lecture this week), we are in big trouble with this data. Last week we joined the two tables using the Postal Codes as the joining field or column. Lets go through the steps as a class and discuss what problems occur and why we joined the tables in the way (direction) we did – TAKE NOTES
Questions to consider for a better understanding:
Do we lose any data when we join in this direction?
Would we lose data if we joined in the other direction?
Why should the joining be done in the direction it was
Do we get duplicate points when we create a thefts point layer?
Do we lose any data creating the thefts point layer?
A quick way to add Unique Identifiers in QGIS
Open up either the postal file you created last week (or use /home/labs/geog204/tut1/postal_2004_clean.txt) into QGIS (do you recall how to do so?)
Once the table is in QGIS:
- Convert it to a shapefile –> right click–> save as –> choose ESRI Shapefile –> save to your workspace for the tutorial
- Open the attribute table –> open field calculator –> new field (call it uid perhaps) –> enter the “@row_number” function in the expression panel –> Click OK
Below is a screenshot of the settings used for the steps above:
Downloading data from Stats Can – Census Analyzer (detailed steps for data without fuld DAUID values in the dataset)
We are going to continue using the Canadian Census Analyzer for acquiring data. In this case we are looking for census information – so we must think about which census boundary layer is most applicable. We also are going to make it harder on ourselves by purposely downloading incomplete data sets and then fixing them. We are doing this to to develop skills and understanding for the manipulation of data. In earlier years Stats Can was not as wise in how they supplied data to us users, so as students before you suffered to clean up this data – so will we.
Scott will lead the class in downloading data from the census analyzer to obtain the following data for Dissemination Areas for 2006:
- Total Population
- Total Male Population
- Total Female Population
- The number of occupied private dwellings from 1991 to 2006
We want the data in html format and we also want the data to have the CD code with it (we do not want the DAUID with it – that would be too easy!). PLEASE TAKE NOTES
Cleaning html data in Excel and Wordpad
The downloaded data should be saved into a logical location (proper directory). These data are from Stats Can containing census information for Dissemination Areas. We used the census analyzer to download information for 2006 in html format. You could have obtained any population information you wanted, but you must have included the CD Code as one of the options (not the complete Dissemination Area unique identifier – DAUID).
Cleanup for bringing into QGIS – EXCEL
Open the data in Excel.
Take out any unwanted rows (records) of data
Combine the provcd/da columns to make a DAUID column.
- this is achieved by placing the columns in the correct order (prov-cd| DA).
- make sure that the values are represented in the proper amount of digits per column
(i.e. DA 8 should be 0008).
- securing the proper number of digits is performed by formatting the cells of a whole column
(highlight row –> right click –> format cells –> custom –> type in 0000 in the box).
- insert an empty column between the prov-cd column and the DA columns.
- save as column delimited (CSV comma separated) and open it in wordpad.
- In wordpad click Edit –> Replace ,, with nothing.
- Also change the name of the item names in this file to more suitable names
Cleanup for bringing into QGIS – Libre Office
The same process for cleaning data in EXCEL with one extra import procedure:
- Open Libre Office
- Inset –> Link to external data –> Select the html file –> select auto language –> HTML_all
- The rest is the same as above except the format cell is “user defined” rather than “custom”
The data can now be brought into QGIS BUT YOU CAN DO MORE
Question – for those that prefer to do all of this work in Excel, can you think of another way to concatenate values (combine values from two cells into one). Is there a method for concatenating in QGIS as well?
Cleaning and joining data in QGIS
We can change the data types in QGIS from string to numeric.
Open the dabc_2006.shp layer from tut4 directory in the geog204 labs directory. Add the csv file you just cleaned above to your project. If you are not happy with your data – use the once that Scott just cleaned up (/home/labs/geog204/tut4/census_tut4.csv/). Now save the statscan table as a shape file in a logically name directory in your home directory. How may files are created for this shape file – why?
Open this new statscan shape file then:
- open attribute table
- toggle editing on
- new column
- name it popnum
- whole number integer with width 10
Now we have an extra column in our table that we can use to add data to. In our case we want to transform (cast) the data from one column that is a text (or string) value to a column that holds numeric (integer) values. We do this by:
- Hitting the field calculator in the attribute table box
- Click on Update existing field – use “popnum” (the one we just created)
- Click on “to int” in the Operators panel
- Add the “totpop” (or what ever you called your total population) by clicking on it in the Fields panel
- Add a closing bracket
- Hit OK
- Save your edits
Now you have a clean and numeric table to work with.
Once the fields are created, we can join the statscan table to the dabc table. Can you remember how to do this?
Cleaning the CSV to ensure proper data type import to QGIS – THE EASIER WAY
If you open the data directly into QGIS, you will notice that the population data is in a string format (Character or text) rather than in numeric. For the most part, QGIS will allow you to work with the data as if it was a number – but for many calculations you should have the data as numbers. There is a quick way to ensure the proper import of the data by:
- Copy the first row from the csv file into a new file in a text editor (such as gedit)
- replace each column (before each comma) with the data type each column is surrounded by double quotes
- save as the same file name except at a “t” at the end of the csv extention to be .csvt
Here is the examples from today’s data set – census_tut4.csv
Take the top row to a new file
Replace the top row with the data type
Save as csvt file – census_tut4.csvt
Now the data will be loaded into QGIS with the proper data types
If you reload the cleaned csv file into libre office – you can then save it as a dbf file. It will retain the type of data for each column as well as be editable when loaded into QGIS. This does not always work however, and be careful not to name the dbf file the same as any other shape files.
Resetting the names of the attributes for joined tables
You noticed that because of the 9 or 10 character limit for column headers in a dbf file, your new layer has truncated header labels with a number stuck on the end. Longer names can occur if you do not pay attention to the join prefix option when joining tables (as the software adds the name of the join table file to each cell). This sucks and is only with QGIS 2 onward. To fid this; load the table manger plugin into your project and rename the columns to what you would like.
Couple of questions – thoughts
Why did we set the column width to 10 when we made a column to hold the DANUM (Dissemination Area Ids)?
Did we need to change the values from “text/string/character” to join the tables?
Just for fun – bring the layer into Google Earth
It is easy to take the layer we just created into google earth. We are not going to work on bringing in any styling just the data. In order to do so, just save the layer as if you were saving to a shape file but choose kml instead – then open it in Google Earth.
Census data from 2016 – the easy way
We are going to pull some census data using the Census Analyzer for 2016, but this time we will ask to have the DAUID as part of the download.
Load up the Census Analyzer and navigate to Dissemination Areas for 2016
- Open the Library Pages –> Collections –> Data and Statistics –> Canadian Data –> Census Analyzer
- Enumeration area / Dissemination area –> 2016 –> Fraser Fort George
- Select out the following population statistics:
- Population, 2016 (v1)
- Land area in square kilometres
- Total – Population aged 15 years and over that are employed (this is long form data)
- Select output as html –> save as html page only
- Bring it into Excel or LibreOffice
- Save as a clean csv file
- open in QGIS
Check out the areas from stats can – for fun (not required)
You can open up the 2016 DA boundary files (/home/labs/geog204/boundary_files/2016/da_cartographic/dabc_cartographic_lat-lon.shp) and join the data to the DAs in the Fraser Fort George. Once you have attached the data from your download, save the polygons that have the census values (FFG polygons) to you own folder in UTM 10 projection (EPSG:26910).
Then calculate the area of the polygons by using calculate with the following expresssion:
See how your values compare with Stats Can.
Check out Ujaval Gandhi’s site for working with QGIS – really well done: