Model Validation

A model is useful only in so far as that it is valid. Having proposed models for Arctic sea ice extent and volume, we must address the question of whether the models accurately describe the existing data and predict future data. ARMA time series models place strong demands on the noise model. In order for the model to be valid, the noise, as estimated by the model residuals, must be uncorrelated and have constant variance. A stronger version of this is to require that the noise be independent and identically distributed.

Testing for correlation is typically an automatic part of the model building process. Typically several possible models are tested until one is found which does not show statistically significant correlations in the residuals. In this case, the model for the NSIDC extent does show a statistically significant correlation at an interval of 11 months.

There is a possible physical interpretation for this. The monthly data is averaged over each month. This means that often changes in the ice can be observed in the daily data or in other related data sets which do not appear in the monthly data. However, the year over year changes reflect this change in the ice the following year. A possible example is that in July 2014, by the end of the month it was clear that daily extent loss was slow and so the measured extent was high. This did not appear in the July monthly extent data, but did appear in August, when the August monthly extent was higher than predicted. At the same time, this change appeared in the July 2014 volume data, which was also higher than predicted.

Looking forward one year, the July 2015 monthly extent was higher than predicted. It is possible to argue that the high July 2015 extent reflected the low July 2014 extent loss, which was also shown in the high August 2014 extent. Therefore the July 2015 year-over-year change is correlated with the August 2014 year-over-year change, at an interval of 11 months, and the model should be modified to incorporate this.

Until now, we have considered this possible correlation to be too small, and the overhead in model complexity too large, to justify incorporating this correlation into the model. We will continue to observe this correlation and will modify the model if it becomes necessary to accurately track this correlation.

In contrast, the PIOMAS volume model does not show meaningful significant correlations in the residuals.

The second property that the model requires is that the residuals have constant variance. We prefer the stronger assumption that the residuals are identically distributed, which is a necessary assumption to justify the bootstrap prediction distributions. There are more direct ways of testing for identical distribution, but because we are using these models specifically to predict future ice levels, we propose directly testing the validity of these predictions.

Under the model assumptions, the predictions generated by the model are the true mean of possible future paths taken by the data. The noise is assumed to be independent and identically distributed, with the distribution given by the empirical distribution of the residuals. Under these assumptions, we can create a bootstrap sample of possible futures for the model. For each possible future (or sample path), we can measure the distance from the sample path to the mean path, and then compute a distribution of the distances of the sample paths from the mean path.

We can then measure the distance of the observed actual future path to the predicted mean path. If the model assumptions reasonably describe the true ice behavior, then the observed path should not have a statistically unusual distance from the mean path.

Distance is measured by the sum of the squared differences between the predicted values at each time point and the mean values at each time point.

For the NSIDC extent, a mean prediction and a collection of sample paths were generated for the time period May 2014 - April 2015, and the bootstrap distribution of the distance of the sample paths to the mean prediction was computed. The distance of the observed data for this time period to the mean prediction was computed and compared to the bootstrap distribution. 35.09% of sample paths had a greater distance than the observed distance, so the model appears to reasonably describe the extent over the past year. (This is effectively a two-sided p-value of .70.)

For the PIOMAS volume, a similar computation results in the conclusion that 1.29% of sample paths have a greater distance over the prediction period from the mean than the observed data. (This is a two-sided p-value of .026.) However, this should not necessarily come as a surprise. The prediction error in July 2014 was 1.00055 thousand cubic km, which is the largest positive error in the entire data set. The question then becomes if this prediction error is a failure of the model or a signal of truly atypical behavior of the actual data.

It is tempting to write this off as bad luck. The data is an extreme value among the possible sample paths, but this was a consequence of the fact that the July data was also an extreme value. However, it should also be treated as evidence of typical behavior for the model. In particular, the assumption that the noise has constant variance (or standard deviation) should be reconsidered.

Examination of the residuals on a decade by decade basis indicates that the standard deviation of the residuals (which should equal the standard deviation of the noise) has been growing over time. For 1980 through 1989, the standard deviation was 0.210. For 1990 through 1999, this increased to 0.256. For 2000 through 2009, this was 0.258. Without doing a formal test, the standard deviation may have changed from the first decade of data to the later data, but the 20 year period from 1990 to 2009 appears to be constant.

However, the standard deviation for 2010 through April 2014 (the most recent data used to build the model) is 0.391. This is clearly not in agreement with the earlier time periods. The conclusion is that the ARMA family of models may not provide a good fit in modeling the PIOMAS data. Another possibility, which may behave better, would be to consider models in the ARCH family, which can directly model changes in the variance of the noise.

One final consideration in evaluating the validity of the models is to observe changes in the model based on refitting the model to new data. Rather than considering the new data from May 2014 to April 2015 as new data applied to the existing model, we can rebuild the model from the ground up using all of the data from 1979 through April 2015. If the model represents the data well, we should expect only small changes in the model parameters and in the forward predictions of the model.

The NSIDC model based on the data through April 2014, as stated above, is

Y(t) = -0.0540 + 0.6589 * ( Y(t - 1) + 0.0540 ) - 0.7851 * Z(t - 12) + Z(t)

The revised model based on the data through April 2015 is

Y(t) = -0.0526 + 0.6625 * ( Y(t - 1) + 0.0526 ) - 0.7912 * Z(t - 12) + Z(t)

The parameter values changed by less than 3%. The data for May 2014 to April 2015 was used to generate predictions for May 2015 to April 2016 from the April 2014 model. This was compared to the predictions for May 2015 to April 2016 from the April 2015 model, and the predictions agree to within 0.05 million sq km.

The results for the NSIDC models are more impressive in light of the fact that there was a substantial revision of the NSIDC data in March 2015. The changes were typically on the order of 0.01 million sq km and always less than 0.10 million sq km, but the fact that these changes did not have a disruptive effect on the model supports the robustness of the model.

The PIOMAS model from April 2014 is

Y(t) = -0.3216 + 1.196 * (Y(t - 1) + 0.3216) - 0.2641 * (Y(t - 2) + 0.3216) + 0.3396 * Z(t - 1) - 0.6805 * Z(t - 12) - 0.3396 * 0.6805 * Z(t - 13) + Z(t)

The PIOMAS model from April 2015 is

Y(t) = -0.2889 + 1.206 * (Y(t - 1) + 0.2889) - 0.2684 * (Y(t - 2) + 0.2889) + 0.3312 * Z(t - 1) - 0.6867 * Z(t - 12) - 0.3312 * 0.6867 * Z(t - 13) + Z(t)

The changes are small, with the exception of the mean value, which has changed from a loss of 320 cu km year over year to a loss of 290 cu km year over year. The predicted values for both models for the time period of May 2015 to April 2016 were compared and the largest difference in predicted values is 120 cu km. This occurs at the April maximum. On a percentage basis the largest differences are around 1%, near the September minimum.

Predictions

The models for NSIDC extent and PIOMAS volume were updated to reflect the data as of April 2015, and then were used to generate predictions for extent and volume for the following 12 months. The predictions are listed below.

Month | May | June | July | Aug | Sept | Oct | Nov | Dec | Jan | Feb | March | April |

Extent | 12.63 | 10.95 | 8.13 | 5.68 | 4.77 | 7.58 | 9.94 | 12.17 | 13.58 | 14.38 | 14.63 | 14.05 |

Volume | 22.99 | 18.38 | 11.76 | 7.48 | 6.22 | 7.34 | 10.48 | 14.02 | 17.48 | 20.32 | 22.34 | 23.37 |

The reported data for NSIDC extent for May, June, and July is 12.65, 10.97, and 8.77, respectively. The predictions for May and June were so close to the observed data that correcting for the observations has essentially no effect on the remaining predictions. July, on the other hand, had a substantial missed prediction. This was used to correct the predictions for August and September, resulting in updated values of 6.10 and 5.05. The 95% prediction interval for the September extent is 4.39 to 5.67 million sq km.

Note that extent loss this year has been somewhat atypical. Extent loss in June was low, followed by high extent loss in July. Because of the way that monthly extent is computed by the NSIDC, this may have resulted in an artificially high July extent, contributing to the model's prediction error. In that case, we should expect the August and September values to be below the current predictions. Even so, the observed values are expected to fall within the 95% prediction intervals.

The reported data for PIOMAS volume in May, June, and July is 23.02, 18.54, and 11.63. The fact that July was lower than predicted has the effect of pushing down the updated predictions, so the updated values for August and September are 7.18 and 5.89. The 95% prediction interval for the September volume is 4.83 to 6.83 thousand cubic kilometers.

These models are intended to be short range models. The differencing gives the models the character of a random walk, and the resulting variance of the prediction intervals grows rapidly over time periods greater than one year. The extent model has been shown to be reasonably accurate over short time periods, but no claims have been made about longer periods. The volume model has not been demonstrated to control the prediction error, even over periods of less than a year.

Throwing caution to the wind, we will use the models to predict the September extent and volume through 2019. The predictions include the mean prediction, the 95% prediction interval, the probability of setting a new record minimum, and the probability of falling below 1 million sq km (extent) or 1 thousand cu km (volume).

Extent

Year | Mean | 95% PI | Record | Less than 1000000 |

2015 | 5.05 | 4.38 - 5.68 | Less than 0.01% | Less than 0.01% |

2016 | 4.79 | 4.05 - 5.50 | 0.2% | Less than 0.01% |

2017 | 4.74 | 4.00 - 5.46 | 0.3% | Less than 0.01% |

2018 | 4.69 | 3.92 - 5.43 | 0.5% | Less than 0.01% |

2019 | 4.64 | 3.84 - 5.40 | 0.7% | Less than 0.01% |

Volume

Year | Mean | 95% PI | Record | Less than 1000 |

2015 | 5.87 | 4.61 - 7.01 | 0.2% | Less than 0.01% |

2016 | 5.24 | 2.80 - 7.49 | 10.4% | 0.04% |

2017 | 4.86 | 2.09 - 7.47 | 21.5% | 0.3% |

2018 | 4.52 | 1.58 - 7.41 | 30.5% | 1.0% |

2019 | 4.20 | 1.05 - 7.30 | 39% | 2.3% |

We should emphasize that these predictions should be viewed with a healthy amount of skepticism. Predictions at the longest time frames are far outside the model bounds, and the volume model has not been demonstrated to be reliable even over shorter time intervals. From a numerical standpoint, the mean values are reasonable, but the prediction intervals for the extent (and associated probabilities of setting a new minimum and falling below 1 million sq km) are suspiciously small.

On the other hand, quick observation of the historical extent data for September shows that a linear trend with annual slope of -53000 sq km and a prediction interval of plus or minus 1 million sq km captures nearly all of the observed data. (Notable exceptions include 2007 and 2012, but 2013 and 2014 are just below the trend.) On that basis, the above predictions may not be unreasonable.

Future Work

As discussed above, the current model for the NSIDC extent shows a correlation in the residuals at an interval of 11 months. The model can potentially be modified to account for this.

The model for the PIOMAS volume fails to have residuals with constant variance. Exploring other time series models (for example, in the ARCH family) which can represent changing variance will likely improve the prediction accuracy of the model, even if just by generating more realistic prediction intervals.

The models are deliberately designed to be simple, with minimal data input. Incorporating more data into the models could improve performance. Possibilities include daily data, regional breakdowns, and data from other sources. As a possible starting point, perhaps the data could be considered as a joint time series of both extent and volume.