In todays labs we are going to look at optimizing our databases for efficiency, in terms of storage space, performance, and truth. We will do this through a process called Normalization.
Keys & Normalized form
Normalized form is a large topic that could easily be its own course, we are going to look at very basic forms. We will be guided by the following principals
- Don’t repeat yourself. Storing the same data in multiple places wastes disk space, and can lead to multiple versions of truth.
- Finding information should be easy. We want to keep our keys as small as they can be, while still being unique. To access any piece of data we want to Join as few tables as possible.
As an example lets say we have a client database but a person may have multiple contact methods. Our table may look something like this.
|John Doe||3333 – University Way||Jan 1, 1970||John@example.ca||555-1234|
|John Doe||3333 – University Way||Jan 1, 1970||John@example.ca||555-5555|
|Jane Doe||3333 – University Way||Jan 2, 1970||Jane@example.ca||555-1234|
|John Doe||3333 – University Way||Jan 1, 1970||John@tempuri.com||555-5555|
This table has many issues with it, John Doe has his address 3 times, 2 email addresses, and 2 phone numbers spread across 4 records. If he moves we need to update his address 3 times. And looking up his email or phone number is needlessly complex. If we use Name as a Primary key we can make a structure that looks more like this
|John Doe||3333 – University Way||Jan 1, 1970|
|Jane Doe||3333 – University Way||Jan 2, 1970|
In the example above, it may not look like we have less data as the name as the primary key is duplicated many time, but we have avoided repeating e-mails, and phone numbers. We have made it much easier to update the address.
Discussion: Is name a good primary key?
In it’s simplest form a view is a query that we access as a table, views come in two forms, standard views which exicute the query that generates them every time you access the table. And Materialized views, a materialized view saves the output of the query for faster access, and are updated only when a trigger asks.
Standard views: We will use standard views to produce intermediate products on demand. If you think bakc to last weeks labs in the last section we had to join 3 tables; an alternative to this is to make a view containing the first join, and then join the view to the 3rd table.
Las week we had this code:
SELECT csdname, fire_year, SUM(population) AS "Population" FROM blocks_pop LEFT JOIN da_boundary ON da_boundary.dauid = SUBSTRING(blocks_pop.dbuid, 1 , 8) LEFT JOIN fires ON ST_Intersects(blocks_pop.geom, fires.geom) WHERE fire_year >= 2015 GROUP BY csdname, fire_year ORDER BY csdname, fire_Year;
The first join connects the blocks_pop with the place names in the da_boundary file, lets turn this into a view.
Right click on Views, then select Create >>> View…
Give it a Name, I used pop_blocks_name
Then on the Code Tab enter the SQL for the JOIN
SELECT blocks_pop.*, prname, cdname, ccsname, csdname, ername, cmaname, ctname FROM blocks_pop LEFT JOIN da_boundary ON da_boundary.dauid = SUBSTRING(blocks_pop.dbuid, 1, 8)
And finally take a look at the SQL tab to see how this would look on the command line, instead of the gui, then click save.
Make sure the new table shows up under views.
And rewrite your query to look like this:
SELECT csdname, fire_year, SUM(population) AS "Population" FROM pop_blocks_name LEFT JOIN fires ON ST_Intersects(pop_blocks_name.geom, fires.geom) WHERE fire_year >= 2015 GROUP BY csdname, fire_year ORDER BY csdname, fire_Year;
Views work as a great shorthand if we have joins that we are commonly completing, and in this case updating the source data will update the results. With that in mind spatial operations can take a significatnt amout of time (arround 2 seconds for this querry). If we can make the assumption that our data rarely changes we can calcuate the join when we add new data instead of when we read data, next lets create a materialized view.
This time our code will be:
SELECT pop_blocks_name.*, fires.geom AS fire_geom, fire_year, fire_cause, fire_date FROM pop_blocks_name LEFT JOIN fires ON ST_Intersects(pop_blocks_name.geom, fires.geom) WHERE fire_year >= 2015
Now if you right click on the view and choose “View/Edit Data” and get the top 100 rows, what do you get back?
Now refresh the view with data
Now you should have data in the table. Anytime you add data you will need to refresh this view, note this could also be done with a Trigger (google this if you are interested, triggers are out of the scope of this course).
Finally rewrite your code to look like this:
SELECT csdname, fire_year, SUM(population) AS "Population" FROM fire_pop GROUP BY csdname, fire_year ORDER BY csdname, fire_Year;
Finally lets take a minute to think about why we would do this, do you remember last lab, we used WHERE fire_year >= 2015, as a way of subsetting our data to keep runtime reasonable. With the use of the view we could spend the time to calculate all the ST_Intersects, and then compare years to our hearts content.
Now lets take a look at the magical land of indexes. Indexes allow us to work faster by adding structure do the data in our tables. Under normal circumstances everytime data is added to the database it is simply added to the end. Imaging for a minute that the phone book was sorted by date people got phone service instead of last name, how long it might take to find someones number. The phone book is indexed though linearly, in the case of databases they are indexed as trees, though the same principle applies.
CREATE INDEX da_boundary_idx ON da_boundary (dauid); CREATE INDEX blocks_pop_idx ON blocks_pop (dbuid);
For spatial atributes they are not as easy to index, we cannot simply put polygons in order, so instead we calculate bounding boxes and then us an R-Tree (https://en.wikipedia.org/wiki/R-tree#:~:text=R%2Dtrees%20are%20tree%20data,both%20theoretical%20and%20applied%20contexts.) for the index, the exact details of this are out of the scope of this course but we will just say we get bounding boxes and sort them to make things eaiser to find like so:
CREATE INDEX blocks_pop_geom_idx ON blocks_pop USING GIST (geom); CREATE INDEX fires_geom_idx ON fires USING GIST (geom);
Back to Python!
First lets get our git ready, in your git folder make a new file, and name it “.gitignore”, the contents of this file should be
This file will tell git to not send our virtual environment folder, or any file with a .local extension to the git server. Once created we will git add, git commit, and git push this file.
Next on the list is to save our database configuration make a folder for lab5 inside of your git folder, and inside here make a file called config_example.py, the contents of this file are
db_connect_string = "dbname='qgis' user='postgres' host='localhost' password='postgres'"
Once this is done, save as config_local.py, and edit the dbname and password to match what you have been using in PG admin.
Notice how config_local.py is gray in visual studio code? This is because we don’t want to sync our password to the git server and it is ignored.
Next we are going to install a database connector, open a terminal and run the command
pip3 install psycopg2-binary
Now make your lab5.py file and we will do some imports
import config_local import psycopg2
Run this file with the command
You should get new warnings, but these are computers, so we will pause here while we get the bugs sorted out.
Nex lets add the python to connect to our database, add this below the imports
try: conn = psycopg2.connect(config_local.db_connect_string) cursor = conn.cursor() except: print("I am unable to connect to the database")
And run it again, here we are also hoping for no output.
cursor.execute("SELECT * FROM fires") fires = cursor.fetchall() # For each polygon for fire in fires: print(fire)