A toy statistical model of ice thickness
Contents
A toy statistical model of ice thickness#
This notebook primarily introduces the gis.gridded_attributes
and gis.gridded_mb_attributes
tasks which are simply adding certain diagnostic variables to the glacier maps such as slope, aspect, distance from border, and other mass-balance related variables which might be useful to some people.
We also show how to use these data to build a very simple (and not very good) regression model of ice thickness. This second part (“Build a statistical model”) is rather technical and not directly related to OGGM, so you might stop at this point unless your are into the topic yourself. You will need to install scikit-learn for the second part of this notebook.
Set-up#
We are going to use the South Glacier example taken from the ITMIX experiment. We’ll also use the data provided with it.
## Libs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import salem
# OGGM
import oggm.cfg as cfg
from oggm import utils, workflow, tasks, graphics
# Initialize OGGM and set up the default run parameters
cfg.initialize(logging_level='WARNING')
cfg.PARAMS['use_multiprocessing'] = False
# Local working directory (where OGGM will write its output)
cfg.PATHS['working_dir'] = utils.gettempdir('OGGM_Toy_Thickness_Model')
# We start at level 3, because we need all data for the attributes
gdirs = workflow.init_glacier_directories(['RGI60-01.16195'], from_prepro_level=3, prepro_border=10)
2023-03-07 12:46:49: oggm.cfg: Reading default parameters from the OGGM `params.cfg` configuration file.
2023-03-07 12:46:49: oggm.cfg: Multiprocessing switched OFF according to the parameter file.
2023-03-07 12:46:49: oggm.cfg: Multiprocessing: using all available processors (N=2)
2023-03-07 12:46:49: oggm.workflow: init_glacier_directories from prepro level 3 on 1 glaciers.
2023-03-07 12:46:49: oggm.workflow: Execute entity tasks [gdir_from_prepro] on 1 glaciers
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 7
5 cfg.PATHS['working_dir'] = utils.gettempdir('OGGM_Toy_Thickness_Model')
6 # We start at level 3, because we need all data for the attributes
----> 7 gdirs = workflow.init_glacier_directories(['RGI60-01.16195'], from_prepro_level=3, prepro_border=10)
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:533, in init_glacier_directories(rgidf, reset, force, from_prepro_level, prepro_border, prepro_rgi_version, prepro_base_url, from_tar, delete_tar)
531 if cfg.PARAMS['dl_verify']:
532 utils.get_dl_verify_data('cluster.klima.uni-bremen.de')
--> 533 gdirs = execute_entity_task(gdir_from_prepro, entities,
534 from_prepro_level=from_prepro_level,
535 prepro_border=prepro_border,
536 prepro_rgi_version=prepro_rgi_version,
537 base_url=prepro_base_url)
538 else:
539 # We can set the intersects file automatically here
540 if (cfg.PARAMS['use_intersects'] and
541 len(cfg.PARAMS['intersects_gdf']) == 0 and
542 not from_tar):
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:191, in execute_entity_task(task, gdirs, **kwargs)
187 if ng > 3:
188 log.workflow('WARNING: you are trying to run an entity task on '
189 '%d glaciers with multiprocessing turned off. OGGM '
190 'will run faster with multiprocessing turned on.', ng)
--> 191 out = [pc(gdir) for gdir in gdirs]
193 return out
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:191, in <listcomp>(.0)
187 if ng > 3:
188 log.workflow('WARNING: you are trying to run an entity task on '
189 '%d glaciers with multiprocessing turned off. OGGM '
190 'will run faster with multiprocessing turned on.', ng)
--> 191 out = [pc(gdir) for gdir in gdirs]
193 return out
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:108, in _pickle_copier.__call__(self, arg)
106 for func in self.call_func:
107 func, kwargs = func
--> 108 res = self._call_internal(func, arg, kwargs)
109 return res
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:102, in _pickle_copier._call_internal(self, call_func, gdir, kwargs)
99 gdir, gdir_kwargs = gdir
100 kwargs.update(gdir_kwargs)
--> 102 return call_func(gdir, **kwargs)
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:251, in gdir_from_prepro(entity, from_prepro_level, prepro_border, prepro_rgi_version, base_url)
248 tar_base = utils.get_prepro_gdir(prepro_rgi_version, rid, prepro_border,
249 from_prepro_level, base_url=base_url)
250 from_tar = os.path.join(tar_base.replace('.tar', ''), rid + '.tar.gz')
--> 251 return oggm.GlacierDirectory(entity, from_tar=from_tar)
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/utils/_workflow.py:2529, in GlacierDirectory.__init__(self, rgi_entity, base_dir, reset, from_tar, delete_tar)
2525 raise RuntimeError('RGI Version not supported: '
2526 '{}'.format(self.rgi_version))
2528 # remove spurious characters and trailing blanks
-> 2529 self.name = filter_rgi_name(name)
2531 # region
2532 reg_names, subreg_names = parse_rgi_meta(version=self.rgi_version[0])
File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/utils/_funcs.py:734, in filter_rgi_name(name)
728 def filter_rgi_name(name):
729 """Remove spurious characters and trailing blanks from RGI glacier name.
730
731 This seems to be unnecessary with RGI V6
732 """
--> 734 if name is None or len(name) == 0:
735 return ''
737 if name[-1] in ['À', 'È', 'è', '\x9c', '3', 'Ð', '°', '¾',
738 '\r', '\x93', '¤', '0', '`', '/', 'C', '@',
739 'Å', '\x06', '\x10', '^', 'å', ';']:
TypeError: object of type 'numpy.float64' has no len()
These are the tasks adding the attributes to the gridded file:
# Tested tasks
task_list = [
tasks.gridded_attributes,
tasks.gridded_mb_attributes,
]
for task in task_list:
workflow.execute_entity_task(task, gdirs)
# Pick our glacier
gdir = gdirs[0]
The gridded attributes#
Let’s open the gridded data file with xarray:
ds = xr.open_dataset(gdir.get_filepath('gridded_data'))
# List all variables
ds
The file contains several variables with their description. For example:
ds.oggm_mb_above_z
Let’s plot a few of them (we show how to plot them with xarray and with oggm:
ds.slope.plot();
plt.axis('equal');
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
graphics.plot_raster(gdir, var_name='aspect', cmap='twilight', ax=ax1)
graphics.plot_raster(gdir, var_name='oggm_mb_above_z', ax=ax2)
Retrieve these attributes at point locations#
For this glacier, we have ice thickness observations (all downloaded from the supplementary material of the ITMIX paper). Let’s make a table out of it:
df = salem.read_shapefile(utils.get_demo_file('IceThick_SouthGlacier.shp'))
coords = np.array([p.xy for p in df.geometry]).squeeze()
df['lon'] = coords[:, 0]
df['lat'] = coords[:, 1]
df = df[['lon', 'lat', 'thick']]
Convert the longitudes and latitudes to the glacier map projection:
xx, yy = salem.transform_proj(salem.wgs84, gdir.grid.proj, df['lon'].values, df['lat'].values)
df['x'] = xx
df['y'] = yy
And plot these data:
geom = gdir.read_shapefile('outlines')
f, ax = plt.subplots()
df.plot.scatter(x='x', y='y', c='thick', cmap='viridis', s=10, ax=ax);
geom.plot(ax=ax, facecolor='none', edgecolor='k');
Method 1: interpolation#
The measurement points of this dataset are very frequent and close to each other. There are plenty of them:
len(df)
Here, we will keep them all and interpolate the variables of interest at the point’s location. We use xarray for this:
vns = ['topo',
'slope',
'slope_factor',
'aspect',
'dis_from_border',
'catchment_area',
'lin_mb_above_z',
'lin_mb_above_z_on_catch',
'oggm_mb_above_z',
'oggm_mb_above_z_on_catch',
]
# Interpolate (bilinear)
for vn in vns:
df[vn] = ds[vn].interp(x=('z', df.x), y=('z', df.y))
Let’s see how these variables can explain the measured ice thickness:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
df.plot.scatter(x='dis_from_border', y='thick', ax=ax1);
df.plot.scatter(x='slope', y='thick', ax=ax2);
df.plot.scatter(x='oggm_mb_above_z', y='thick', ax=ax3);
There is a negative correlation with slope (as expected), a positive correlation with the mass-flux (oggm_mb_above_z), and a slight positive correlation with the distance from the glacier boundaries.
Method 2: aggregated per grid point#
There are so many points that much of the information obtained by OGGM is interpolated. A way to deal with this is to aggregate all the measurement points per grid point and to average them. Let’s do this:
df_agg = df[['lon', 'lat', 'thick']].copy()
ii, jj = gdir.grid.transform(df['lon'], df['lat'], crs=salem.wgs84, nearest=True)
df_agg['i'] = ii
df_agg['j'] = jj
# We trick by creating an index of similar i's and j's
df_agg['ij'] = ['{:04d}_{:04d}'.format(i, j) for i, j in zip(ii, jj)]
df_agg = df_agg.groupby('ij').mean()
# Conversion does not preserve ints
df_agg['i'] = df_agg['i'].astype(int)
df_agg['j'] = df_agg['j'].astype(int)
len(df_agg)
# Select
for vn in vns:
df_agg[vn] = ds[vn].isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
We now have 9 times less points, but the main features of the data remain unchanged:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
df_agg.plot.scatter(x='dis_from_border', y='thick', ax=ax1);
df_agg.plot.scatter(x='slope', y='thick', ax=ax2);
df_agg.plot.scatter(x='oggm_mb_above_z', y='thick', ax=ax3);
Build a statistical model#
Let’s use scikit-learn to build a very simple linear model of ice-thickness. First, we have to acknowledge that there is a correlation between many of the explanatory variables we will use:
import seaborn as sns
plt.figure(figsize=(10, 8));
sns.heatmap(df.corr(), cmap='RdBu');
This is a problem for linear regression models, which cannot deal well with correlated explanatory variables. We have to do a so-called “feature selection”, i.e. keep only the variables which are independently important to explain the outcome.
For the sake of simplicity, let’s use the Lasso method to help us out:
import warnings
warnings.filterwarnings("ignore") # sklearn sends a lot of warnings
# Scikit learn (can be installed e.g. via conda install -c anaconda scikit-learn)
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
# Prepare our data
df = df.dropna()
# Variable to model
target = df['thick']
# Predictors - remove x and y (redundant with lon lat)
# Normalize it in order to be able to compare the factors
data = df.drop(['thick', 'x', 'y'], axis=1).copy()
data_mean = data.mean()
data_std = data.std()
data = (data - data_mean) / data_std
# Use cross-validation to select the regularization parameter
lasso_cv = LassoCV(cv=5, random_state=0)
lasso_cv.fit(data.values, target.values)
We now have a statistical model trained on the full dataset. Let’s see how it compares to the calibration data:
odf = df.copy()
odf['thick_predicted'] = lasso_cv.predict(data.values)
f, ax = plt.subplots(figsize=(6, 6));
odf.plot.scatter(x='thick', y='thick_predicted', ax=ax);
plt.xlim([-25, 220]);
plt.ylim([-25, 220]);
Not so well. It is doing OK for low thicknesses, but high thickness are strongly underestimated. Which explanatory variables did the model choose as being the most important?
predictors = pd.Series(lasso_cv.coef_, index=data.columns)
predictors
This is interesting. Lons and lats have a predictive power over thickness? Unfortunately, this is more a coincidence than a reality. If we look at the correlation of the error with the variables of importance, one sees that there is a large correlation between the error and the spatial variables:
odf['error'] = np.abs(odf['thick_predicted'] - odf['thick'])
odf.corr()['error'].loc[predictors.loc[predictors != 0].index]
Furthermore, the model is not very robust. Let’s use cross-validation to check our model parameters:
k_fold = KFold(4, random_state=0, shuffle=True)
for k, (train, test) in enumerate(k_fold.split(data.values, target.values)):
lasso_cv.fit(data.values[train], target.values[train])
print("[fold {0}] alpha: {1:.5f}, score (r^2): {2:.5f}".
format(k, lasso_cv.alpha_, lasso_cv.score(data.values[test], target.values[test])))
The fact that the hyper-parameter alpha and the score change that much between iterations is a sign that the model isn’t very robust.
Our model is bad, but… let’s apply it#
In order to apply the model to our entre glacier, we have to get the explanatory variables from the gridded dataset again:
# Add variables we are missing
lon, lat = gdir.grid.ll_coordinates
ds['lon'] = (('y', 'x'), lon)
ds['lat'] = (('y', 'x'), lat)
# Generate our dataset
pred_data = pd.DataFrame()
for vn in data.columns:
pred_data[vn] = ds[vn].data[ds.glacier_mask == 1]
# Normalize using the same normalization constants
pred_data = (pred_data - data_mean) / data_std
# Apply the model
pred_data['thick'] = lasso_cv.predict(pred_data.values)
# For the sake of physics, clip negative thickness values...
pred_data['thick'] = np.clip(pred_data['thick'], 0, None)
# Back to 2d and in xarray
var = ds[vn].data * np.NaN
var[ds.glacier_mask == 1] = pred_data['thick']
ds['linear_model_thick'] = (('y', 'x'), var)
ds['linear_model_thick'].attrs['description'] = 'Predicted thickness'
ds['linear_model_thick'].attrs['units'] = 'm'
ds['linear_model_thick'].plot();
Take home points#
we have shown how to compute gridded attributes from OGGM glaciers such as slope, aspect, or catchments
we used two methods to extract these data at point locations: with interpolation or with aggregated averages on each grid point
as an application example, we trained a linear regression model to predict the ice thickness of this glacier at unseen locations
The model we developed was quite bad and we used quite lousy statistics. If I had more time to make it better, I would:
make a pre-selection of meaningful predictors to avoid discontinuities
use a non-linear model
use cross-validation to better asses the true skill of the model
…
What’s next?#
return to the OGGM documentation
back to the table of contents
Bonus: how does the statistical model compare to OGGM “out-of the box”?#
# Close the dataset cause we are going to write in it
ds = ds.load()
ds.close()
ds.to_netcdf(gdir.get_filepath('gridded_data'))
# Distribute thickness using default values only
workflow.execute_entity_task(tasks.distribute_thickness_per_altitude, gdirs);
# Add the linear model data for comparison
ds = xr.open_dataset(gdir.get_filepath('gridded_data'))
df_agg['oggm_thick'] = ds.distributed_thickness.isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
df_agg['linear_model_thick'] = ds.linear_model_thick.isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5));
ds['linear_model_thick'].plot(ax=ax1);
ds['distributed_thickness'].plot(ax=ax2);
plt.tight_layout();
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6));
df_agg.plot.scatter(x='thick', y='linear_model_thick', ax=ax1);
ax1.set_xlim([-25, 220]);
ax1.set_ylim([-25, 220]);
df_agg.plot.scatter(x='thick', y='oggm_thick', ax=ax2);
ax2.set_xlim([-25, 220]);
ax2.set_ylim([-25, 220]);