Today we will be leveraging the cloud, the short version is that the cloud is any service that uses IT resources accessed through the internet and not on the campus we are currently at.

The Cloud
https://xkcd.com/908/

For remote sensing this is particularly advantageous as we can access data without the restrictions of our internet speeds, and we can access large amounts of computing resources. In todays lab we will be using the Microsoft Planetary Computer SDK (Software Development Kit) to access imagery through the provided Application Interface (API). Additionally we will learn the basics of Dask cluster to scale our work across many servers. At the time of writing the Microsoft Planetary Computer has a maximum of 400 cores and 3.2TB of RAM in a single Dask Cluster, this is in comparison to the entire GIS Lab at UNBC operating on 384 cores, and 2TB of RAM.

If you want to see an example of the type of infrastructure we will actually be using today, take a look here:

Log into the MPC

Before signing in we can see the Datasets Available here: https://planetarycomputer.microsoft.com/catalog. Note that these data sets may differ from the datasets available in Google Earth Engine (https://developers.google.com/earth-engine/datasets/) as such data availability may impact your platform choice.

Go to https://planetarycomputer.microsoft.com/ and select the HUB from the top menu.

Select the CPU – Python environment (And now we wait…)

As a point of interest, GPU of either PyTorch or TensorFlow may offer substantial performance increases for Machine Learning in properly implemented.

Start a new Notebook

We will be working through some examples based upon the Planetary Computer Quickstart Guides https://planetarycomputer.microsoft.com/docs

At this stage we should all be familiar with Jupyter Notebooks, and as always we will start by importing the libraries used for todays lab.

import numpy as np
import xarray as xr

import rasterio.features
import stackstac
import pystac_client
import planetary_computer

import pandas as pd
import xrspatial.multispectral as ms
import time
import datetime
import itertools

from dask_gateway import GatewayCluster

Dask

Dask is a task schedular for parallelizing Python Code, which is to say that it splits problems into smaller pieces and distributes these pieces to multiple workers. Workers could be cores in your CPU, and in the case of clusters workers can be separate virtual machines.

Lets build our first cluster

cluster = GatewayCluster()  # Creates the Dask Scheduler. Might take a minute.

client = cluster.get_client()

cluster.adapt(minimum=4, maximum=24)
print(cluster.dashboard_link)

This will createe a cluster, get the client that will allow our processing to happen in the cluster, and then set the number of workers to adaptive with a minimum of 4 nodes and a maximum of 24 nodes, this is adventagous, as we do not wish to wait for workers to start everytime a command is run, so 4 are always kept running. At the same time there is no need for 24 nodes to be sitting idle consuming resources other users could be utilizing. Using adapt will allow short jobs to complete on the prexisting nodes, larger jobs will spin up workers as needed, for large jobs this spinup represents relativly low overhead.

Finaly the last line prints the URL to your clusters dashboard, lets click on that and take a look arround.

Accessing Data

Before we can open data, we must first get an Area of Interest, there is a great tool from Keene University that formats the Data nicely for use in python. https://www.keene.edu/campus/maps/tool/

Choose an area of interest, and replace the corrdinates in the code below (note that the first and last points are the same, to close the polygon)

area_of_interest = {
    "type": "Polygon",
    "coordinates": [
        [
          [-122.8226566, 53.9006842],
          [-122.826519, 53.8837441],
          [-122.8019714, 53.8829854],
          [-122.8004265, 53.9007348],
          [-122.8227425, 53.9006842],
        ]
    ],
}
bbox = rasterio.features.bounds(area_of_interest)	

Next we will create a STAC or SpatioTemporal Asset Catalog

#Create stac object from MPC API
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

#Define Search Terms
search = stac.search(
    bbox=bbox,
    datetime="2016-01-01/2020-12-31",
    collections=["sentinel-2-l2a"],
    limit=500,  # fetch items in batches of 500
    query={"eo:cloud_cover": {"lt": 25}},
)

#Exicute Search
items = list(search.get_items())

#Print How Many Items were found
print(len(items))

Now we have one more step, to our search has returned the list of images available, however to actually download the ddata we need to sign the URL’s with a personal access token. If you are not familiar with an Access Token, think of it as a password that software can use to work on behalf of your account. The token is far supirior to a username and password as there are restrictions on what can be accessed, and the token will expire making it much safer to use in plane text, though should still be guarded.

signed_items = [planetary_computer.sign(item).to_dict() for item in items]

Finally build the stack

data = (
    stackstac.stack(
        signed_items,
        assets=["B04", "B03", "B02", "B08"],  # red, green, blue, NIR
        chunksize=4096,
        resolution=100,
    )
    .where(lambda x: x > 0, other=np.nan)  # sentinel-2 uses 0 as nodata
    .assign_coords(band=lambda x: x.common_name.rename("band"))  # use common names
)
data

As we downsampled to 100m pixels the data is small enough to keep in memory for faster analysis

data = data.persist()

Analyzing Data

One of the simple analysis we can do is creating a cloud free image, this is done by taking the median pixel over time, so long as clouds cover the pixel less than half of the images this should produce a cloud free result.

median = data.median(dim="time").compute() # Median of time not location
image = ms.true_color(r=median[0], g=median[1] ,b=median[2])

examine the result

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 8))

ax.set_axis_off()
image.plot.imshow(ax=ax);

This could also be computed over given time intervals, for example months (January is 0, December is 11)

#Sort images by month, then compute median
monthly = data.groupby("time.month").median().compute()

#Create array of True Colour images
images = [ms.true_color(r=x[0], g=x[1] ,b=x[2]) for x in monthly]
images = xr.concat(images, dim="time")

g = images.plot.imshow(x="x", y="y", rgb="band", col="time", col_wrap=3, figsize=(6, 8))
for ax in g.axes.flat:
    ax.set_axis_off()

plt.tight_layout()

However there is a problem here in that this contains multiple years. An alternative approach would be to resample the data.

monthly = data.resample(time="1M").median().compute()

However you will notice a different issue, any ideas what the issue is?

NDVI

#RED = x[0, :, :]
#NIR = x[3, :, :]
#ndvi = (NIR - RED)/(NIR + RED) = (x[3, :, :] - x[0, :, :])/(x[3, :, :] + x[0, :, :])

images = [(x[3, :, :] - x[0, :, :])/(x[3, :, :] + x[0, :, :]) for x in monthly]
images = xr.concat(images, dim="time")

g = images.plot.imshow(x="x", y="y", col="time", col_wrap=3, figsize=(8, 8))
for ax in g.axes.flat:
    ax.set_axis_off()

Additional Data Sets (MODIS)

Some data is also available, however not from the MPC API, for this we will use wget, open a terminal and use pip to install wget.

pip install wget

And load some extra libraries we will need

import tempfile
import wget
import numpy as np
import matplotlib.pyplot as plt
import os

from azure.storage.blob import ContainerClient

Connect to the data storage

modis_account_name = 'modissa'
modis_container_name = 'modis-006'
modis_account_url = 'https://' + modis_account_name + '.blob.core.windows.net/'
modis_blob_root = modis_account_url + modis_container_name + '/'

Add a function provided by NASA to determine tile index from location

# This file is provided by NASA; it indicates the lat/lon extents of each
# MODIS tile.
#
# The file originally comes from:
#
# https://modis-land.gsfc.nasa.gov/pdf/sn_bound_10deg.txt
modis_tile_extents_url = modis_blob_root + 'sn_bound_10deg.txt'

temp_dir = os.path.join(tempfile.gettempdir(),'modis')
os.makedirs(temp_dir,exist_ok=True)
fn = os.path.join(temp_dir,modis_tile_extents_url.split('/')[-1])
wget.download(modis_tile_extents_url, fn)

# Load this file into a table, where each row is (v,h,lonmin,lonmax,latmin,latmax)
modis_tile_extents = np.genfromtxt(fn,
                     skip_header = 7, 
                     skip_footer = 3)

modis_container_client = ContainerClient(account_url=modis_account_url, 
                                         container_name=modis_container_name,
                                         credential=None)

def lat_lon_to_modis_tile(lat,lon):
    """
    Get the modis tile indices (h,v) for a given lat/lon
    
    https://www.earthdatascience.org/tutorials/convert-modis-tile-to-lat-lon/
    """
    
    found_matching_tile = False
    i = 0
    while(not found_matching_tile):
        found_matching_tile = lat >= modis_tile_extents[i, 4] \
        and lat <= modis_tile_extents[i, 5] \
        and lon >= modis_tile_extents[i, 2] and lon <= modis_tile_extents[i, 3]
        i += 1
        
    v = int(modis_tile_extents[i-1, 0])
    h = int(modis_tile_extents[i-1, 1])
    
    return h,v
  
  def list_blobs_in_folder(container_name,folder_name):
    """
    List all blobs in a virtual folder in an Azure blob container
    """
    
    files = []
    generator = modis_container_client.list_blobs(name_starts_with=folder_name)
    for blob in generator:
        files.append(blob.name)
    return files
  
     
def list_hdf_blobs_in_folder(container_name,folder_name):
    """"
    List .hdf files in a folder
    """
    
    files = list_blobs_in_folder(container_name,folder_name)
    files = [fn for fn in files if fn.endswith('.hdf')]
    return files             

We can search for MODIS tiles, however the date is not in a typical format, using year and day of year.

product = 'MCD43A4'

# Let's look at the tile containing Chicago, IL, on May 15, 2019 (day of year 135)
h,v = lat_lon_to_modis_tile(41.881832,-87.623177)
daynum_start= 2019135
daynum_end = 2019160

file_list = [] # Empty List to stor results
for daynum in range(daynum_start, daynum_end + 1, 1):
    folder = product + '/' + '{:0>2d}/{:0>2d}'.format(h,v) + '/' + str(daynum)
    date = pd.to_datetime(datetime.date(int(str(daynum)[:4]), 1, 1) + datetime.timedelta(int(str(daynum)[4:]) -1 ))
    # Find all HDF files from this tile on this day
    filenames = list_hdf_blobs_in_folder(modis_container_name,folder)
    #print('Found {} matching file(s):'.format(len(filenames)))
    #for fn in filenames:
    #    print(fn)

    # Work with the first returned URL
    blob_name = filenames[0]

    # Download to a temporary file
    url = modis_blob_root + blob_name

    filename = os.path.join(temp_dir,blob_name.replace('/','_'))
    if not os.path.isfile(filename):
        wget.download(url,filename)
    
    file_list.append([filename, date]) # Format is file_list[0] <--> filename, file_list[1] <--> timestamp

Make a Raster IO array

import rioxarray as rxr

image_list = []
for file in file_list: # Format is file_list[0] <--> filename, file_list[1] <--> timestamp
    ds = rxr.open_rasterio(file[0]) # ds is rioxarray of image at 'filename'
    image_list.append([ds, file[1]]) # Format is image_list[0] <--> dataset, image_list[1] <--> timestamp
results = [] # Empty List to stor results

# For each image: 1) Load Bands, 2) Perform Calculation, 3) Append results, Optionally print for debugging
for image in image_list: # Format is image_list[0] <--> dataset, image_list[1] <--> timestamp
    # Open Bands
    band1 = image[0]['Nadir_Reflectance_Band1']
    # ...
    
    # Perform Calculations
    tmp = np.nanmean(band1) # Average reflectance of pixels in band 1
    
    # Append Results
    results.append([tmp, image[1]])  # Format is results[0] <--> result, result[1] <--> timestamp
    
    # Print
    print([tmp_val, image[1]])  

Unpacking Lists

When representing tabular data in lists, the data will be either (TYPE 1) a list of rows where each element in that list is a column, or (TYPE 2) a list of columns where each element is a row.

Given the table below which we will call T

When the code above is run when we are appending rows of results we end up with data formatted as TYPE A. When we want to plot data this is an issue as the plot command needs two vectors giving the format plot(y, x). However once the problem is recognised it is easy to solve using the zip function.

TYPE_B = list(zip(TYPE_A))

Finishing up

Before leaving shutdown your Dask Cluster

cluster.close()

And logout (in the file menu)

Assignment

  1. Compute NDSI groupby month from Sentinel 2 imagery from 2016 to 2021.
  2. Compute NDVI per day in 2021 (only April & May needed) of MODIS data
    Use matlibplot to graph the average NDVI of the tile.
  3. How is Dask different from standard multicore processing?
  4. What is a STAC and how does it differ from from the .safe file Sentinel images are usually downloaded as?