Error analysis of the global pre-processing workflow

Error analysis of the global pre-processing workflow#

Here we reproduce the error analysis shown in Maussion et al. (2019) for the pre-processing part only, and for the glacier directories (version 1.6 and 1.4). The error analysis of user runs needs a separate handling, see the deal_with_errors notebook for more information.

Get the files#

We download the glacier_statistics files from the preprocessed directories folders, at the level 5. That is, we are going to count all errors that happened during the pre-processing chain.

from oggm import utils
import pandas as pd
import seaborn as sns

We will start with a preprocessed directory for OGGM v1.6 (W5E5 with centerlines):

# W5E5 centerlines
url = 'https://cluster.klima.uni-bremen.de/~oggm/gdirs/oggm_v1.6/L3-L5_files/2023.3/centerlines/W5E5/RGI62/b_080/L5/summary/'
# this can take some time
df = []
for rgi_reg in range(1, 19):
    fpath = utils.file_downloader(url + f'glacier_statistics_{rgi_reg:02d}.csv')
    df.append(pd.read_csv(fpath, index_col=0, low_memory=False))
df = pd.concat(df, sort=False).sort_index()

Analyze the errors#

sns.countplot(y="error_task", data=df);
../../_images/a1f79906b108c57c9e1c8d80d8816109952d0cd206acdf729650b0fb2c055940.png
"% area errors all sources: {:.2f}%".format(df.loc[~df['error_task'].isnull()].rgi_area_km2.sum() / df.rgi_area_km2.sum() * 100)
'% area errors all sources: 0.40%'
"% failing glaciers all sources: {:.2f}%".format(df.loc[~df['error_task'].isnull()].rgi_area_km2.count() / df.rgi_area_km2.count() * 100)
'% failing glaciers all sources: 0.53%'

We now look at the errors that occur already before applying the climate tasks and the historical run (i.e., before mb_calibration_from_scalar_mb and flowline_model_run_historical):

dfe = df.loc[~df['error_task'].isnull()]
dfe = dfe.loc[~dfe['error_task'].isin(['mb_calibration_from_scalar_mb','flowline_model_run_historical'])]
"% area errors before climate: {:.2f}%".format(dfe.rgi_area_km2.sum() / df.rgi_area_km2.sum() * 100)
'% area errors before climate: 0.06%'
"% failing glaciers all sources: {:.2f}%".format(dfe.loc[~dfe['error_task'].isnull()].rgi_area_km2.count() / df.rgi_area_km2.count() * 100)
'% failing glaciers all sources: 0.32%'

Although there are already many glaciers failing before the climate tasks, from a relative missing glacier area perspective, much less of the failing glacier area occurs. The reason is that the largest failing glaciers have mostly flowline_model_run_historical errors that only occur in the preprocessed directories level 4 or higher (after the climate tasks):

# 15 largest glaciers
df.loc[~df['error_task'].isnull()].sort_values(by='rgi_area_km2',
                                               ascending=False)[['rgi_area_km2', 'error_task',
                                                                 'error_msg']].iloc[:15]
rgi_area_km2 error_task error_msg
rgi_id
RGI60-13.54431 663.729 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-13.43483 282.721 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-17.01218 110.978 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-17.13796 69.313 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-10.00002 48.144 glacier_masks GeometryError: RGI60-10.00002 is a nominal gla...
RGI60-13.01430 43.094 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-14.10738 39.506 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-15.04151 36.050 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-03.04079 35.752 initialize_flowlines RuntimeError: Altitude range of main flowline ...
RGI60-13.04943 34.599 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-13.13612 30.578 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-14.08485 29.861 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-17.13799 29.768 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-02.02622 28.163 flowline_model_run_historical RuntimeError: CFL error: required time step sm...
RGI60-14.08487 27.122 flowline_model_run_historical RuntimeError: CFL error: required time step sm...

Example error comparison#

  • centerlines vs elevation bands:

# W5E5 elevation bands
url = 'https://cluster.klima.uni-bremen.de/~oggm/gdirs/oggm_v1.6/L3-L5_files/2023.3/elev_bands/W5E5/RGI62/b_080/L5/summary/'
# this can take some time
df_elev = []
# we don't look at RGI19 (Antarctic glaciers), because no climate available for CRU
for rgi_reg in range(1, 19):
    fpath = utils.file_downloader(url + f'glacier_statistics_{rgi_reg:02d}.csv')
    df_elev.append(pd.read_csv(fpath, index_col=0, low_memory=False))
df_elev = pd.concat(df_elev, sort=False).sort_index()
rel_area_elev = df_elev.loc[~df_elev['error_task'].isnull()].rgi_area_km2.sum() / df_elev.rgi_area_km2.sum() * 100
rel_area_cent = df.loc[~df['error_task'].isnull()].rgi_area_km2.sum() / df.rgi_area_km2.sum() * 100
print(f'% area errors from all sources for elevation band flowlines is {rel_area_elev:.2f}%'+'\n'
      f'compared to {rel_area_cent:.2f}% for centerlines with W5E5')                                                         
% area errors from all sources for elevation band flowlines is 0.08%
compared to 0.40% for centerlines with W5E5

much less errors occur when using elevation band flowlines than when using centerlines!

-> Reason: less glacier_mask errors!

# you can check out the different error messages with that
# but we only output the first 20 here
df_elev.error_msg.dropna().unique()[:20]
array(['ValueError: Trapezoid beds need to have origin widths > 0.',
       'ValueError: operands could not be broadcast together with shapes (4,) (5,) ',
       'AssertionError: ',
       'ValueError: operands could not be broadcast together with shapes (3,) (5,) ',
       'RuntimeError: RGI60-01.24246: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: RGI60-01.24248: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: RGI60-01.24600: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: RGI60-01.24602: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: RGI60-01.24603: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: RGI60-01.26517: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: RGI60-01.26971: we tried very hard but we could not find a combination of parameters that works for this ref mb.',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 2007.4166666666667',
       'InvalidDEMError: (RGI60-02.13793) DEM altidude range too small.',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 1983.75',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 1984.0833333333333',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 1983.25',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 1997.8333333333333',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 1985.25',
       'ValueError: operands could not be broadcast together with shapes (2,) (5,) ',
       'RuntimeError: Glacier exceeds domain boundaries, at year: 1973.3333333333333'],
      dtype=object)
  • Compare different climate datasets

    • this works only when using OGGM v1.4 urls (comparing CRU vs ERA5)

    • in OGGM v1.6 (state: March 2023), only GSWP3_W5E5 exists

# attention here OGGM_v1.4 is used and this is just for demonstration purposes how to compare
# different preprocessed directories!
# This is CRU + centerlines. But you can try CRU+elev_bands, or ERA5+elev_bands, etc!
url = 'https://cluster.klima.uni-bremen.de/~oggm/gdirs/oggm_v1.4/L3-L5_files/CRU/centerlines/qc3/pcp2.5/no_match/RGI62/b_040/L5/summary/'
# this can take some time
df_cru_v14 = []
# we don't look at RGI19 (Antarctic glaciers), because no climate available for CRU
for rgi_reg in range(1, 19):
    fpath = utils.file_downloader(url + f'glacier_statistics_{rgi_reg:02d}.csv')
    df_cru_v14.append(pd.read_csv(fpath, index_col=0, low_memory=False))
df_cru_v14 = pd.concat(df_cru_v14, sort=False).sort_index()
# ERA5 uses a different precipitation factor in OGGM_v1.4
url = 'https://cluster.klima.uni-bremen.de/~oggm/gdirs/oggm_v1.4/L3-L5_files/ERA5/centerlines/qc3/pcp1.6/no_match/RGI62/b_040/L5/summary/'
# this can take some time
df_era5_v14 = []
# we don't look at RGI19 (Antarctic glaciers), because no climate available for CRU
for rgi_reg in range(1, 19):
    fpath = utils.file_downloader(url + f'glacier_statistics_{rgi_reg:02d}.csv')
    df_era5_v14.append(pd.read_csv(fpath, index_col=0, low_memory=False))
df_era5_v14 = pd.concat(df_era5_v14, sort=False).sort_index()
rel_area_cent_era5_v14 = df_era5_v14.loc[~df_era5_v14['error_task'].isnull()].rgi_area_km2.sum() / df_era5_v14.rgi_area_km2.sum() * 100
rel_area_cent_cru_v14 = df_cru_v14.loc[~df_cru_v14['error_task'].isnull()].rgi_area_km2.sum() / df_cru_v14.rgi_area_km2.sum() * 100
print(f"% area errors all sources for ERA5 is {rel_area_cent_era5_v14:.2f}% compared to {rel_area_cent_cru_v14:.2f}% for CRU")
                                                                        
% area errors all sources for ERA5 is 0.92% compared to 3.24% for CRU

more than three times less errors from the climate tasks occur when using ERA5 than when using CRU !

  • Compare between OGGM versions (and climate datasets)

print('% area errors from all sources for centerlines is: \n'+
      f'{rel_area_cent_cru_v14:.2f}% for CRU and OGGM_v14 \n'+
      f'{rel_area_cent_era5_v14:.2f}% for ERA5 and OGGM_v14 \n'+
      f'{rel_area_cent:.2f}% for W5E5 and OGGM_v16')
                                                                              
% area errors from all sources for centerlines is: 
3.24% for CRU and OGGM_v14 
0.92% for ERA5 and OGGM_v14 
0.40% for W5E5 and OGGM_v16

Great, the most recent preprocessed directories create the least amount of failing glacier area

This is either a result of the different applied climate, MB calibration or other changes in OGGM_v16. This could be checked more in details by looking into which tasks fail less.

What’s next?#