1. Introduction
The rapid decline of summer Arctic sea ice over the satellite era (Fig. 1) has led to increased socioeconomic activity in the region and an emerging need for skillful predictions of sea ice conditions (Jung et al. 2016; Wagner et al. 2020). Following the then-record-setting 2007 September Arctic sea ice extent (SIE) minimum, a new research subfield emerged focused on scientific understanding of sea ice predictability and prediction. At the core of this research community has been the Sea Ice Outlook (SIO), which collects, analyzes, and synthesizes real-time seasonal predictions of September pan-Arctic SIE [Stroeve et al. (2014); see arcus.org/sipn/sea-ice-outlook]. From 2008 to the present, the SIO has collected predictions of September SIE initialized on 1 June, 1 July, and 1 August, months that span the summer Arctic melt season. The SIO began additionally collecting 1 September initialized predictions in 2021. The number of annual SIO submissions has grown steadily over time, with approximately 40 groups submitting predictions in recent years. These submissions are provided by an international community of polar scientists and employ a diverse mix of dynamical modeling, statistical, and heuristic approaches.
In parallel to the growth of the SIO, a body of work on sea ice predictability has been developed, which underpins the expectation that sea ice could be predictable on seasonal time scales. Coupled global climate models (GCMs) have been used to estimate the upper limits of sea ice predictability based on “perfect model” ensemble experiments, which quantify potential prediction skill in the case of perfectly known initial conditions, forcing, and model physics. These studies have shown that, with typical sample sizes, Arctic SIE potential predictability is statistically significant up to 12–36 months in advance (Koenigk and Mikolajewicz 2009; Blanchard-Wrigglesworth et al. 2011b; Holland et al. 2011; Tietsche et al. 2014; Day et al. 2014; Bushuk et al. 2019; Holland et al. 2019); however, they may overestimate nature’s true predictability limits due to the overly persistent SIE anomalies present in most modern GCMs (Blanchard-Wrigglesworth and Bushuk 2019; Giesse et al. 2021). The inherent predictability of Arctic sea ice is determined by a competition between the slowly evolving predictable components of the ice–ocean–land system and the comparatively unpredictable variability of the atmosphere (Tietsche et al. 2016). A number of physical mechanisms for summer Arctic SIE predictability have been demonstrated. These include the persistence and reemergence of SIE and sea ice concentration (SIC) anomalies (Blanchard-Wrigglesworth et al. 2011a; Bushuk and Giannakis 2015; Ordoñez et al. 2018; Giesse et al. 2021; Zhang et al. 2021), the persistence and advection of sea ice thickness (SIT) anomalies (Holland et al. 2011; Blanchard-Wrigglesworth et al. 2011b; Chevallier and Salas y Mélia 2012; Krumpen et al. 2013; Blanchard-Wrigglesworth and Bitz 2014; Day et al. 2014; Collow et al. 2015; Massonnet et al. 2015; Guemas et al. 2016; Williams et al. 2016; Blanchard-Wrigglesworth et al. 2017; Bushuk et al. 2017b; Dirkson et al. 2017; Blockley and Peterson 2018; Holland et al. 2019; Babb et al. 2019; Bonan et al. 2019; Brunette et al. 2019; Ponsoni et al. 2020; Babb et al. 2020; Balan-Sarojini et al. 2021), ocean heat transport and persistence of upper ocean heat content anomalies (Serreze et al. 2016; Lenetsky et al. 2021; Bushuk et al. 2022), melt onset and summer ice–albedo feedback processes (Schröder et al. 2014; Kapsch et al. 2014; Landy et al. 2015; Liu et al. 2015; Cox et al. 2016; Zhan and Davies 2017; Kwok et al. 2018; Bushuk et al. 2020), and summertime atmospheric circulation patterns (Ding et al. 2017, 2019; Baxter et al. 2019; Baxter and Ding 2022). Taken together, these studies have laid critical groundwork, showing that sea ice should be potentially predictable on seasonal time scales.
Have modern prediction systems capitalized upon this potential predictability and produced skillful predictions of observed Arctic sea ice? There is a tension in the sea ice prediction literature regarding this question. On the one hand, a number of studies have evaluated the performance of September SIE predictions submitted in real-time to the SIO and found that these predictions have only a modest skill advantage relative to a baseline linear trend prediction (Stroeve et al. 2014; Blanchard-Wrigglesworth et al. 2015; Hamilton and Stroeve 2016; Lukovich et al. 2021; Blanchard-Wrigglesworth et al. 2023). The initial assessment performed by Stroeve et al. (2014) on SIO predictions submitted over the period of 2008–13 found that, regardless of the method, predictions struggled to capture years with large SIE anomalies relative to the linear trend. These initial findings have been largely corroborated over the longer assessment periods of 2008–15 and 2008–22 considered by Hamilton and Stroeve (2016) and Blanchard-Wrigglesworth et al. (2023), respectively. Blanchard-Wrigglesworth et al. (2023) found that the SIO multimodel median prediction has a similar skill to a damped anomaly persistence forecast from 1 July to 1 August initialization dates and is slightly more skillful than damped persistence from 1 June. They found that the skill of individual models was lower than the multimodel median skill and had worse skill than damped persistence.
On the other hand, there has been a recent proliferation of studies that document the development of seasonal prediction systems capable of skillfully predicting detrended September Arctic SIE anomalies. These skill assessments are based on retrospective seasonal predictions (also known as hindcasts or reforecasts), which use a fixed initialization and modeling formulation to make seasonal predictions of past observations using only data that would have been available at the time of initialization. Many dynamical prediction systems, which are based on initialized coupled dynamical models, have recently shown skillful seasonal predictions of detrended September Arctic SIE anomalies (Chevallier et al. 2013; Merryfield et al. 2013; Sigmond et al. 2013; Wang et al. 2013; Msadek et al. 2014; Collow et al. 2015; Peterson et al. 2015; Guemas et al. 2016; Sigmond et al. 2016; Bushuk et al. 2017a; Dirkson et al. 2017, 2019; Harnos et al. 2019; Kimmritz et al. 2019; Batté et al. 2020; Shu et al. 2021; Bushuk et al. 2022; Zhang et al. 2022; Martin et al. 2023). Simultaneously, many statistical prediction systems, which leverage empirical relationships in past observational data, have also demonstrated skillful detrended SIE predictions (Drobot et al. 2006; Lindsay et al. 2008; Kapsch et al. 2014; Schröder et al. 2014; Williams et al. 2016; Yuan et al. 2016; Serreze et al. 2016; Petty et al. 2017; Kondrashov et al. 2018; Brunette et al. 2019; Ionita et al. 2019; Walsh et al. 2019; Gregory et al. 2020; Andersson et al. 2021; Chi et al. 2021; Horvath et al. 2021). Both dynamical and statistical predictions (see section 2b ahead) have been shown to outperform the damped persistence forecast in most cases. This discrepancy between retrospective and real-time prediction skill represents a key tension in the sea ice prediction literature.
While many dynamical and statistical prediction systems have documented “skillful” SIE predictions, it is arguably more important to consider the quantitative level of skill and whether such predictions could provide value to end users (Murphy 1993). The sea ice prediction community gathered for a Sea Ice Outlook Contributors Forum in 2021 where this and many other issues were discussed (Steele et al. 2021). Many workshop attendees expressed a need to rigorously quantify the current state of the art across modern sea ice prediction systems. Unfortunately, this quantitative skill comparison is challenging due to differences in the evaluation time period and skill metrics considered across different studies and the relatively short period of real-time SIO predictions. This knowledge gap led to a key outcome of the SIO Forum—the expressed need for an “apples-to-apples” skill comparison of modern dynamical and statistical sea ice prediction systems. This community intercomparison of sea ice prediction skill forms the basis of the present study.
The outline for this paper is as follows. In section 2, we describe a retrospective prediction data request that was sent to the SIO contributor community, summarize the prediction methodologies used by the 35 groups who contributed predictions, and outline our methods for assessing prediction skill against multiple observational products. In section 3, we assess pan-Arctic September SIE prediction skill across dynamical and statistical models and consider whether SIE prediction skill has changed over time. In section 4, we consider smaller spatial scales, evaluating regional SIE prediction skill in five Arctic regions and comparing pan-Arctic and regional performance. Finally, we assess the prediction skill for local SIC and ice edge predictions in section 5. We discuss our findings in section 6, focusing on the key elements of successful sea ice prediction systems and the skill differences between retrospective and real-time predictions. Conclusions and a future outlook are presented in section 7.
2. Methods
a. Retrospective prediction data request.
To facilitate a direct apples-to-apples skill comparison of SIO models, a data request for retrospective predictions of September Arctic sea ice was sent to the SIO contributor community in early 2022. The data request was for retrospective predictions initialized on the SIO initialization dates of 1 June, 1 July, 1 August, and 1 September and spanning a minimum period of 2001–20. The requested target variables were September monthly mean pan-Arctic SIE, regional SIE, and gridded SIC fields. Pan-Arctic SIE is defined as the area of all Northern Hemisphere grid cells covered by at least 15% SIC. We define monthly mean SIE following the NSIDC sea ice index convention, which defines the monthly mean extent as the monthly mean of the daily SIE values. Regional SIE was requested for four regional domains: the Alaskan Seas (Chukchi and Beaufort), Siberian Seas (East Siberian and Laptev), Atlantic Seas (Kara, Barents, and Greenland), and Canadian Seas (Canadian Archipelago and Baffin Bay). These regions were defined based on a recently updated NSIDC region mask, which has better agreement with the regional definitions used by the International Hydrographic Organization (Meier and Stewart 2023; see map in Fig. 1d). We also later derived central Arctic regional SIE using the submitted pan-Arctic and regional SIE values by taking their difference. SIO contributors were invited to submit any combination of the requested target variables and initialization dates along with metadata describing the design of their prediction system. We also requested submission of individual ensemble members, if applicable, and the initial SIC and SIT conditions used for dynamical predictions. Contributors were informed that the NSIDC sea ice index and SIC climate data record would be the official verification products, but we also utilize Ocean and Sea Ice Satellite Application Facility (OSI SAF) observations for verification in this study (see section 2c ahead). For groups that only provided SIC predictions, pan-Arctic and regional SIE were computed on the native model grid and a postprocessing was applied to remove biases (see ahead), including those related to land–sea mask differences.
Retrospective prediction contributions were received from 17 statistical models, 17 dynamical models, and 1 heuristic prediction (see summary of submitted data in Table 1). These contributions span 11 countries across Europe, Asia, and North America and provide a total of 2807 individual predictions of September pan-Arctic SIE (1267 statistical; 1526 dynamical; 14 heuristic). All data have been subsequently formatted into a common format and made publicly available via an online repository (https://zenodo.org/doi/10.5281/zenodo.10124346). The online repository also contains scripts for processing the raw data, computing skill metrics, and producing all figures for this study. This is the most comprehensive dataset of multimodel Arctic sea ice predictions that has been assembled to date and is intended to provide an open community resource for future sea ice prediction research. In this study, we will focus on ensemble-mean sea ice predictions in order to compare ensemble and deterministic contributions, since this is the primary focus of the SIO and allows for the largest set of models to be compared.
Summary of submitted retrospective prediction data. Target variables are pan-Arctic SIE (P), regional SIE (R), SIC (S), and the number of ensemble members (e), indicated in parentheses. The variables that are bias corrected are shown in parentheses in the bias-correction column.
b. Statistical and dynamical prediction systems.
The submitted predictions can be grouped into two main categories—dynamical and statistical predictions. Dynamical predictions are based on numerical dynamical models that are initialized from observationally constrained initial conditions and integrated forward in time. Statistical predictions are based on empirical predictor–predictand relationships and are trained using past observational or reanalysis data. It should also be noted that the distinction between dynamical and statistical methods is not perfect, for example, many dynamical models use statistical postprocessing techniques to bias correct their predictions and many statistical models are trained on reanalysis-based predictor data. There is also one submitted “heuristic” prediction from the NCAR/University of Colorado sea ice pool. This office pool collects September SIE predictions each summer on 1 June from NCAR/University of Colorado (CU) scientists and serves as a useful “human expert assessment” baseline to compare against the skill of dynamical and statistical models (Hamilton et al. 2014).
Table 2 summarizes the dynamical prediction systems, which come in three main varieties: fully coupled global models, fully coupled regional models driven by specified lateral boundary conditions, and ice–ocean models driven by specified atmospheric forcing. Fully coupled global models are the most common model formulation, likely because many centers have carefully developed these models for climate modeling applications. Regional models offer the advantage of substantial computational savings, allowing for Arctic simulations at higher resolution, but come with the additional challenges of requiring high-quality boundary conditions and significant research investment in model development. The ice–ocean models that use specified atmospheric forcing are driven either using atmospheric fields from another prediction system or using reanalysis atmospheric fields from previous years. The spatial ice–ocean resolution of the global dynamical models ranges from 0.25° to 2.8° nominal horizontal resolution, whereas the two submitted regional models have 0.08° and 0.3° nominal resolutions, respectively. The horizontal atmospheric resolutions employed range from 0.4° to 2.8°. Most of the dynamical prediction systems incorporate observations of SIC (11 of 17 systems), sea surface temperature (SST; 14 systems), ocean temperature and salinity (T/S) profiles (13 systems), and reanalysis atmospheric data (15 systems) into their initialization procedure. A number of systems also initialize their models using observed sea level anomaly (SLA) data (four systems) and SIT data (two systems). A variety of different data assimilation techniques are employed including 3DVAR, 4DVAR, strongly and weakly coupled ensemble Kalman filters (EnKFs), nudging, optimal interpolation, and reanalysis-forced ice–ocean runs. Most of the dynamical models are ensemble prediction systems, and their deterministic SIO prediction is taken as the ensemble mean. Note that for SIE predictions, SIE is first computed for each ensemble member and then averaged to form the ensemble mean.
Summary of dynamical prediction models. Acronyms used are ECMWF Reanalysis (ERA), Integrated Forecasting System (IFS), Ocean Reanalysis System (ORAS), Operational Ocean Analysis System version 5 (OCEAN5), National Centers for Environmental Prediction (NCEP), Climate Forecast System Reanalysis (CFSR), Modern-Era Retrospective Analysis for Research and Applications (MERRA), Japanese Reanalysis (JRA), Climate Prediction Center (CPC), numerical weather prediction (NWP), Forecast Ocean Assimilation Model (FOAM), HadISST2 combined with Canadian Ice Service Charts (Had2CIS), CCCma Coupled Climate Model version 4 with ice initialization (CanCM4i), Global Environmental Multiscale model and Nucleus for European Modelling of the Ocean (GEM-NEMO), and NEMO three-dimensional variational ocean data assimilation (NEMOVAR).
The methodologies of each statistical prediction system are summarized in Table 3. A variety of different methods are employed, including standard statistical techniques such as linear regression, multiple regression, autoregressive models, and more complex methods including convolutional neural networks, Gaussian process regression, multivariate linear Markov models, long short-term memory networks, and harmonic decomposition. Most models include a sea ice predictor variable—typically SIE or SIC—and some models also include thermodynamic ocean variables and dynamic and thermodynamic atmospheric variables as predictors (see Table 3, column 2). The reader is reminded that the predictand variables are provided in Table 1, column 5. All submitted statistical models are trained using past data only. Some prediction systems choose to specify a designated training period (e.g., 1979–2000) and use a fixed statistical model to predict all future years (e.g., 2001–21). Other systems retrain their model each successive year using all available past data (e.g., predict 2001 based on 1979–2000 data and predict 2002 based on 1979–2001 data). As such, we are unable to disentangle the relative skill from the sophistication of the statistical approach versus other aspects of the statistical forecast (e.g., the use of training data).
Summary of statistical prediction models. Acronyms used for training/initialization data are sea ice velocity (SIU), melt pond area (MPA), ocean heat content (OHC), ocean temperature (OT), 2-m air temperature (SAT), downwelling longwave radiation (LWDN), downwelling shortwave radiation (SWDN), net surface heat flux (NSHF), sea level pressure (SLP), surface pressure (PS), geopotential height (Z), surface wind (USURF/VSURF), winds at geopotential height level (UZ/VZ), specific humidity (q), rain rate (RR), snowfall rate (SR), precipitable water content (PWC), Icelandic low (IL), and Arctic Oscillation (AO).
Many of the systems perform a postprocessing of their predictions in order to correct systematic biases present in their retrospective predictions (see Table 1). The bias-correction methods employed are relatively simple, such as correction of the mean bias, correction of the trend, or a linear regression adjustment. Some systems bias correct their SIE time series directly, whereas others correct the SIC spatial fields. We note that some bias-correction methods require computing anomalies relative to a climatology, which may implicitly incorporate future data. This is a standard approach for retrospective prediction assessment but may artificially increase prediction skill (Risbey et al. 2021).
c. Observational verification.
Consistent with the SIO evaluation, we verify pan-Arctic SIE predictions against the NSIDC sea ice index, version 3 (Fetterer et al. 2017), which is based on the NASA team retrieval algorithm. We also verify pan-Arctic SIE predictions against the OSI SAF sea ice index, version 2.1 (OSI-420), which uses the Bristol/Bootstrap retrieval algorithm (Lavergne et al. 2019). SIC predictions are verified against the NOAA/NSIDC climate data record (CDR) of SIC, version 4 (G02202; Meier et al. 2021) and the OSI SAF SIC CDR, release 3 (OSI-450a; EUMETSAT Ocean and Sea Ice Satellite Application Facility 2022). Both of these products use a spatial interpolation to gap fill the polar observational hole. We also use the NSIDC and OSI SAF CDR SIC data to compute regional SIE using the recently updated NSIDC Arctic region mask (Meier and Stewart 2023). We perform all SIC analyses on the 25-km NSIDC polar stereographic north grid and regrid each model’s SIC data to the NSIDC grid using bilinear interpolation and NSIDC’s CDR land–sea mask. In cases where the model land–sea boundary lies within the NSIDC ocean domain, nearest neighbor extrapolation to the NSIDC grid is used.
d. Skill metrics.
Note that, unlike RMSE, the detrended RMSE has no contribution from mean bias, since this bias is subtracted off during the detrending procedure, but it does have contributions from conditional biases (predicting the incorrect amplitude of anomalies). Another commonly used metric in the sea ice prediction literature is the mean-squared error skill score (MSESS), which is connected to the ACC and RMSE via the decomposition of Murphy (1988). In particular, the squared ACC skill provides an upper bound on the MSESS and can be interpreted as the variance explained by a regression-adjusted forecast that is free of conditional and mean biases.
To facilitate an apples-to-apples skill comparison, we focus most of our analysis on the 2001–20 time period, which is the period with the most submitted predictions (see Table 1). Note that some models were only able to submit predictions for a portion of this time period, which may bias their skill results. Specifically, 24 models submitted predictions for the full 2001–20 period and 31 models submitted at least 14 years of predictions. We also include figures in the supplemental material showing prediction skill metrics computed over the full time period submitted by each model (Figs. S5 and S6 in the online supplemental material). We emphasize that the overall conclusions of the study are unchanged if the full time period is used for computing skill.
e. Reference and multimodel predictions.
We also compute a multimodel median prediction, which is the median predicted value across all models for each year and each lead time. The multimodel median prediction is only computed for years with at least 10 models available (years 1993–2021), in order to reduce the impact of the sampling bias.
3. Pan-Arctic predictions
a. September pan-Arctic SIE prediction skill.
We begin by assessing the ability of models to predict September pan-Arctic SIE, which is the flagship prediction target of the SIO. Figure 2 shows the time series of NSIDC observed September SIE (black) and multimodel median predictions (red) from initialization dates of 1 June–1 September. The red shading indicates the interquartile range (middle 50%) of individual model ensemble mean predictions. We find that the multimodel median prediction has high skill across SIO lead times, capturing both the observed SIE trend and interannual variations over the period 1993–2021. The ACC values, which include a substantial trend contribution, are greater than 0.9 for all lead times, whereas the detrended ACC values range from 0.66 to 0.97. The RMSE values of the multimodel median prediction are substantially smaller than the observed detrended standard deviation (0.54 million km2), indicating prediction skill relative to the trend climatology prediction. We find that the multimodel predictions become more confident (decreased intermodel spread) as the lead time decreases and also capture SIE anomalies with greater skill. For example, better predictions of the extreme 1996, 2007, and 2012 SIE anomalies are made from 1 July than from 1 June, and similar improvements are seen in the forecasts from 1 August and 1 September, respectively. We note that the retrospective skill of the multimodel median prediction is considerably higher than the skill of multimodel median real-time predictions submitted to the SIO (see Fig. S1). We return to this point in the discussion section (section 6b). Versions of Fig. 2 for each individual model submitted can be viewed on GitHub (https://github.com/MitchBushuk/SIO_review_paper).
Next, we take a more granular view and explore the prediction skill of individual models. Figures 3 and 4 show the prediction skill of the dynamical and statistical models, respectively. We find that the majority of dynamical and statistical models are skillful at SIO lead times, outperforming the trend climatology prediction (dashed gray line). The models also generally outperform damped persistence (solid gray line) from 1 June and 1 July, whereas damped persistance provides a more challenging benchmark from 1 August and 1 September, with about half the models beating damped persistence from 1 August and most models losing to damped persistence from 1 September. While there is a large spread in skill across models, we find that the majority of models have detrended ACC values that exceed 0.4 from 1 June and 0.5 from 1 July onward, the latter of which is a commonly used practical threshold for useful forecast skill. The fact that this broad set of models, which employ diverse prediction methodologies and input datasets, are generally skillful at SIO lead times shows that useful real-time multimonth predictions of September sea ice should be achievable.
The very high skill of damped persistence from 1 September (detrended ACC of 0.98) indicates that interannual fluctuations of September-mean SIE are essentially “locked in” by 1 September. This high skill demonstrates that the key source of predictability from 1 September is the multiweek persistence of SIE anomalies, which have particularly high persistence values at the time of the summer minimum (Blanchard-Wrigglesworth et al. 2011a). Since these SIE anomalies are observable in near-real time, dynamical prediction systems should, in principle, be able to initialize predictions using these data and capture this source of predictability. However, we find that the majority of dynamical models are less skillful than damped persistence from 1 September, which indicates that they are making errors in their sea ice initial conditions and/or have substantial short-term forecast drift that is not adequately postprocessed in the forecasts. The most skillful dynamical models from 1 September are comparable to the damped persistence benchmark, suggesting that these systems are successfully assimilating sea ice concentration or other related observations. Similarly, the most skillful statistical models are similar to damped persistence from 1 September and most statistical models have lower skill than this benchmark despite, in principle, having access to the same SIE observations as used by the damped persistence forecast. This lower skill likely results from a combination of factors, such as some models using monthly rather than daily data and some models including other predictor variables besides SIE which may negatively impact 1 September skill in favor of higher skill at longer lead times. We also note that training and verifying the damped persistence forecast on different datasets can provide a useful measure of observational uncertainty. We find the detrended ACC values of 0.96, 0.95, and 0.95 based on training/verification pairs of NSIDC/OSI SAF, OSI SAF/NSIDC, and OSI SAF/OSI SAF, respectively, which are slightly lower than the value of 0.98 for NSIDC/NSIDC reported above.
Moving to longer lead times, we find that slightly more than half the models outperform damped persistence from 1 August and nearly all the models outperform damped persistence from 1 June and 1 July. This indicates that the models are successfully capturing other sources of predictability at these lead times, potentially including SIT anomaly persistence, surface albedo anomalies and ice–albedo feedback, surface air temperature anomalies, and atmospheric circulation patterns. Taken as a whole, the pan-Arctic skill of the dynamical and statistical models is broadly similar; however, the model spread precludes definitive statements on which class of method is preferable for pan-Arctic predictions.
The multimodel median prediction has high skill, with detrended ACC values exceeding 0.75 for all SIO lead times. The multimodel median skill is higher than nearly all individual models, suggesting that this prediction benefits from cancellation of random errors across prediction systems, which is a common finding across a variety of prediction applications including the SIO (e.g., Hagedorn et al. 2005; DelSole et al. 2014; Harnos et al. 2019; Blanchard-Wrigglesworth et al. 2023) as well as the Southern Ocean counterpart of the SIO, the Sea Ice Prediction Network (SIPN)-South ensemble (Massonnet et al. 2023). We also note that the skill of a multimodel median prediction based only on dynamical models is similar to the skill of the multimodel median based on all models, whereas the median prediction based only on statistical models has lower skill.
Of the dynamical models, ECMWF’s fifth generation seasonal forecast system (SEAS5) stands out as having particularly high pan-Arctic prediction skill, achieving comparable skill to the multimodel median. There are also two statistical models that are high-skill outliers: the Alfred Wegener Institute (AWI) model, which employs a multiple regression based on stability maps, and the Korea Polar Research Institute (KOPRI) model, which uses a convolutional long short-term memory model. We note that the skill levels of the AWI and KOPRI models are roughly equal to the upper limit of pan-Arctic SIE predictability as estimated by perfect model GCM experiments [cf. with the 1 July initialized forecast skill in Fig. 1 of Tietsche et al. (2014)]. We return to the possible sources of prediction skill across the individual systems in section 6a.
We also verify the predictions using the OSI SAF sea ice index (see Figs. S3 and S4). The OSI SAF sea ice index has a higher mean value than the NSIDC sea ice index (see Fig. 1a), but the indices otherwise have a close agreement, with the ACC of 1.00, detrended ACC of 0.98, and detrended RMSE of 0.10 million km2. Consistent with this close agreement, we find that the skill values are not sensitive to the choice of verification product and that the choice of verification product does not affect the qualitative conclusions regarding pan-Arctic skill. The main difference between the NSIDC and OSI SAF-verified skill metrics occurs for the RMSE skill, since this metric is affected by the mean offset between the products, whereas the other skill metrics are not.
The heuristic prediction submitted from the NCAR/CU sea ice pool provides a useful human expert assessment baseline for pan-Arctic SIE prediction skill. We find that this 1 June heuristic prediction has no skill (ACC = −0.18; detrended ACC = −0.39) over their submission period of 2008–21 [see Hamilton et al. (2014) and more recent figure in https://bit.ly/3MscjmL], emphasizing the inherent challenges in human-based assessments.
b. Is prediction skill changing over time?
Earlier theoretical work has shown that Arctic sea ice predictability is dependent on the mean climate state (Holland and Stroeve 2011; Holland et al. 2011; Cheng et al. 2016; Holland et al. 2019). While some have argued that the recent observed trends toward a thinner and more mobile ice pack may reduce inherent summer sea ice predictability, Holland et al. (2019) show that the changes in sea ice predictability characteristics are highly nonmonotonic under climate change and sea ice predictability actually reaches a local maximum in the CESM1 model in the 2010s decade. We can use the retrospective prediction dataset to investigate this question by analyzing the evolution of prediction errors over time across the multimodel dataset.
Figure 5a shows the multimodel mean of single-model detrended pan-Arctic SIE absolute errors plotted as a function of time (horizontal axis) and initialization date (colors). We find that the error time series do not display clearly identifiable trends but are punctuated by large errors in the extreme sea ice years of 1996, 2007, and 2012, which, respectively, had high, low, and low sea ice extents (Kay et al. 2008; Serreze and Stroeve 2015; Zhang et al. 2013). The trends in prediction errors are not significantly different from zero (at the 95% confidence level) for any initialization month. This finding suggests that there has not been a detectable change in sea ice predictability since 1990.
As expected, we find that the SIE errors increase with lead time, but the error reduction between lead times changes from year to year. For example, 2005 has similar errors across lead times, 2007 shows a similar reduction for each successive initialization month, and 2012 shows large reductions from June to July and from August to September, but little change from July to August. These differences are likely related to the particular synoptic conditions of each summer, for example, the 1 August error is particularly large in 2012, likely because the great Arctic cyclone, which peaked on August 6 (Simmonds and Rudeva 2012) and led to rapid sea ice loss in August, was not predicted (or its impact on sea ice) by seasonal prediction systems (Yamagami et al. 2018).
Earlier work has shown that sea ice predictions typically struggle in “hard to predict years” with large SIE anomalies (Stroeve et al. 2014), sometimes related to atmospheric conditions such as late-summer cyclones (Lukovich et al. 2021; Finocchio et al. 2022). Theoretically, however, some extreme years exhibit seasonal predictability (Tietsche et al. 2013). These large errors in extreme SIE years have been characterized as a major shortcoming of sea ice prediction systems. However, Fig. 5a shows that the linear trend prediction makes much larger errors in these years compared with the prediction systems (cf. dashed gray line to colored lines). Figures 5b and 5c show the skill improvement of the model-based predictions relative to the trend climatology and damped persistence predictions, respectively, with positive values indicating error reductions. We find that the time-mean error reductions are generally positive, indicating that the prediction systems typically provide better skill than the reference forecasts, with the exception of the 1 September damped persistence forecast. Moreover, the extreme SIE years of 1996, 2007, and 2012 stand out as years in which the prediction systems provide the largest skill improvements over the linear trend prediction. This challenges the typical interpretation that prediction systems “failed” in these extreme SIE years. Rather, it is precisely these extreme years that the prediction systems provide the most value added relative to basic reference forecasts.
We next investigate the error characteristics of individual model predictions in Fig. 6. Figure 6a shows the prediction errors from individual models and target years plotted against the observed detrended SIE anomalies in those years. In low SIE years, the models generally overpredict the observed SIE (positive errors) and the models generally underpredict in high SIE years (negative errors). The distribution of errors (Fig. 6b) is relatively symmetric about zero for all initialization times, suggesting that high and low SIE anomalies are similarly difficult to predict. Q–Q plots reveal that the error distributions for all initialization times have symmetric heavy tails compared with a Gaussian distribution, suggestive of outlier models with large errors (not shown). The linear fits to the prediction errors in Fig. 6a (colored lines) have decreasing slopes as the initialization date approaches September and are bracketed by the 1:1 line (a no skill prediction) and the y = 0 line (a perfect prediction). If September SIE was entirely unpredictable, we would expect the errors to lie on the 1:1 line, whereas if it was perfectly predicable, we would expect the errors to lie on the y = 0 line. Thus, the decreasing slopes as the initialization date approaches September shows that inherent SIE predictability increases as the lead time decreases. We also find that the prediction error distributions become progressively more peaked around zero as the lead time decreases (Fig. 6b).
4. Regional predictions
a. September regional SIE prediction skill.
The prediction systems skillfully predict pan-Arctic SIE, but how do they perform on the regional and local scales that users ultimately require? In Figs. 7 and 8, we plot the detrended regional SIE skill for the dynamical and statistical models, respectively, in the five regional domains shown in Fig. 1d. The skill metrics for full regional SIE time series are shown in Figs. S9 and S10.
We find that both dynamical and statistical models have detrended regional skill, but the level of skill is regionally variable. The highest skill is found in the Alaskan and Siberian sectors, in which the multimodel median detrended ACC exceeds 0.75 at SIO lead times. Unlike the pan-Arctic skill results, there is a notable difference between dynamical and statistical model performance in these regions (cf. panels a–d of Figs. 7 and 8). Taken as a whole, the dynamical models outperform the statistical models in the Alaskan and Siberian regions; however, the KOPRI statistical model has high skill in both regions at a level comparable to the most skillful dynamical models. The dynamical models also outperform the statistical models in the central Arctic domain (cf. panels i,j in Figs. 7 and 8), whereas the skill differences are more modest in the Canadian and Atlantic regions (panels e–h in Figs. 7 and 8). Interestingly, the superior regional SIE skill of dynamical models does not clearly translate into better pan-Arctic skill relative to statistical models.
The model skill is lowest in the Atlantic region for both dynamical and statistical models. This is likely because Atlantic September SIE variations result from SIE variability occurring in the northern portions of the Greenland, Barents, and Kara Seas, which are driven by anomalies in sea ice export that are challenging to predict (Kwok 2008). The Canadian Archipelago is also well known as a difficult to predict region due to its complex network of channels and straits. Encouragingly, the majority of statistical and dynamical models show detrended prediction skill in this region, albeit at a generally lower skill level than dynamical models in the Alaskan and Siberian sector. Of the dynamical models, the Regional Arctic System Model (RASM) has high skill in the Canadian region, potentially related to its relatively high horizontal resolution compared to other systems. This higher resolution provides both a more accurate representation of complex land geometry and a more realistic representation of sea ice dynamical and thermodynamical processes. The skill in the central Arctic domain is the second lowest next to the Atlantic. The central Arctic SIE time series is dominated by large anomalies in 2007, 2012, and 2020 (Fig. 1b), which suggests that the models generally struggled to capture the central Arctic anomalies in these years.
Relative to the damped persistence benchmark, the models perform quite skillfully for regional SIE. Analogous to pan-Arctic SIE, regional SIE damped persistence is highly skillful for 1 September forecasts and provides a stringent benchmark that most dynamical and statistical models fail to beat. The models perform more favorably at longer lead times. In the Alaskan, Siberian, and Canadian regions, the majority of models outperform damped persistence from 1 June, 1 July, and 1 August initialization dates. In the central Arctic, most models beat damped persistence from 1 June to 1 July. In the Atlantic sector, the models are notably less skillful than damped persistence from 1 August, suggesting a deficiency in the models in representing summertime Atlantic SIE. These regional skill results are insensitive to the verification product—the same conclusions hold if OSI SAF observations are used for verification (see Figs. S11 and S12).
b. Relation between pan-Arctic and regional skill.
Are models more skillful at predicting pan-Arctic or regional SIE? Do models with high pan-Arctic skill also have high regional skill? We investigate these questions in Fig. 9, which plots regional versus pan-Arctic detrended ACC for each model, colored by lead time. In most regions, the majority of predictions lie below the 1:1 line, indicating that regional SIE skill is generally lower than pan-Arctic skill. The Alaskan region is the most skillfully predicted region, with 46% of predictions lying above the 1:1 line. The damped persistence prediction also lies above the 1:1 line (square markers) indicating that the Alaskan region may have high inherent predictability. The Siberian and Canadian regions are also predicted fairly well, with 37% and 32% of predictions exceeding pan-Arctic skill, respectively. The performance is notably worse in the Atlantic and Central sectors, as each of these regions only has 12% of predictions that exceed pan-Arctic skill. We also find that the regional skill differences across models are related to their Pan-Arctic skill differences. For example, the R2 values between regional and pan-Arctic detrended ACC are 0.59, 0.48, and 0.49 in the Alaskan, Siberian, and Central regions. Regional skill is more decoupled from pan-Arctic skill in the Canadian and Atlantic regions, with R2 values of 0.25 and 0.05, respectively.
5. Sea ice concentration predictions
September SIC prediction skill.
Finally, we consider the ability of models to predict September sea ice variations on the local scale. Figures 10 and 11 show the SIC skill metrics for the dynamical and statistical models that submitted SIC predictions, respectively. These metrics are first computed locally and then area-averaged over the zone of September SIC variability, defined as all grid points where the September SIC standard deviation exceeds 10% (see Fig. 1c). The gap between full and detrended SIC skill is quite small, consistent with the fact that observed SIC variability is dominated by interannual rather than trend-based variance (84% and 16% of the total variance, respectively). Compared to the skill levels for pan-Arctic and regional SIE, the SIC skill scores are lower, consistent with a larger role for unpredictable local-scale dynamics and the fact that, unlike SIE, SIC predictions do not benefit from error compensation (i.e., the cancellation of over- and underestimations). This lower predictability is also reflected by the damped persistence forecast, which is skillful from 1 September but drops off quite rapidly for earlier initialization dates. Interestingly, a handful of models [ECMWF SEAS5, CPC CFSv2, Environmental and Climate Change Canada-Canadian Seasonal to Interannual Prediction System (ECCC-CanSIPSv2), First Institute of Oceanography-Earth System Model (FIO-ESM), GFDL Seamless System for Prediction and Earth System Research with ice data assimilation (GFDL-SPEAR-IDA), Pan-Arctic Ice Ocean Modeling and Assimilation System-CFS (PIOMAS-CFS), and Nico Sun] outperform damped persistence from 1 September, which was not the case for pan-Arctic or regional SIE. This suggests that some models are extracting additional skill from their ability to skillfully predict the atmospheric state over early September and the corresponding local SIC response.