1. Introduction
Many of the physical processes involved in weather or climate simulations occur at spatial scales smaller than the grid spacings of numerical models. These processes, such as radiative transfer, cumulus convection, and turbulent mixing, can drive heat and momentum budgets and therefore are critical to a numerical model’s predicting skill (Bauer et al. 2015). Because explicitly resolving those physical processes are computationally infeasible, we need parameterization schemes that approximate subgrid-scale physical processes using resolvable variables and incorporate the resulting tendencies into the numerical integration. Traditional turbulence parameterization models are developed based on some heuristic assumptions, whose quality and reliability are in question (Chung and Matheou 2014). For example, the eddy viscosity assumption postulates that resolved scalar variances and kinetic energy are always transferred to smaller unresolved scales, but in reality, backscatter can occur at all scales to transfer variance and energy in the opposite direction (Chow et al. 2019). Second, when developing traditional parameterization schemes, the processes to be parameterized are sometimes evaluated independently without allowing a scheme to interact with other processes or the dynamical core (Donahue and Caldwell 2018). Such an artificial separation is also a source of numerical errors and uncertainties (Gross et al. 2018).
In recent years, the development of deep learning (DL) has opened up more opportunities for weather and climate research. Some efforts were made to investigate the feasibility of replacing the entire numerical model with a machine learning (ML) model. For example, MetNet-2, a large-context neural network, outperformed the state-of-the-art, physics-based High-Resolution Ensemble Forecast (HREF) for up to 12 h of lead time in terms of precipitation forecasting (Espeholt et al. 2022). However, for short and medium range and climate change time scales, in which physical principles should be well represented, pure ML methods still fall behind numerical models due to a lack of physical consistency (Weyn et al. 2019). Therefore, using DL methods to augment numerical models is seemingly more attractive and of greater potential for now.
Deep learning–based parameterization is one such application and has gained much attention recently. Deep learning parameterizations can be developed in an a priori setting, that is, learning the map between resolved-scale variables and the tendencies due to subgrid-scale processes directly from training targets and inputs computed from high-resolution benchmark simulations. This approach has been applied to improve the representation of turbulent fluxes in the surface layer (Leufen and Schädler 2019), gravity wave drag (Chantry et al. 2021), moist processes, and a collection of subgrid-scale processes (Gentine et al. 2018; Brenowitz et al. 2020). Those schemes performed well when evaluated offline; however, in online testing, when they are incorporated into numerical models, there are some issues, such as numerical instability and mean state drifting (Brenowitz et al. 2020) and violation of conservation laws (Gentine et al. 2018). Enforcing physical constraints in DL architecture can improve this situation (Beucler et al. 2021); however, those constraints can trigger model instabilities in some cases (Chantry et al. 2021). Overall, a priori training strategy suffers from a physics–dynamics decoupling problem like traditional parameterization. This inconsistent behavior of offline-trained DL models underlines the importance of appropriately incorporating the interaction between DL parameterizations and resolvable physics, dynamical core, and physical principles.
Differentiable programming is one of the potential solutions to make DL parameterizations more tightly coupled to their hosting dynamical core by incorporating the gradient of a numerical solver into back propagation and enabling the gradient-based optimization of the algorithm with DL parameterizations and the dynamical core as a whole (Baydin et al. 2018). Um et al. (2020) developed such a framework using TensorFlow for DL training and ΦFlow for a differentiable partial differentiable equations solver. Kochkov et al. (2021) further integrate the DL models and the numerical method in a “JAX” framework, which supports automatically differentiating native Python and NumPy functions and accelerates them on GPUs or TPUs (Bradbury et al. 2018). The latter feature further accelerates the dynamical core. Kochkov et al. (2021) demonstrated that their learned interpolation method, a DL-based numerical integration approach, takes less time than traditional numerical methods but exhibits the same level of accuracy and numerical stability, suggesting a promising future of applying hybrid DL-differentiable dynamical core in NWP.
However, from Kochkov et al. (2021), it is unclear whether the advantages of DL-based parameterization can be converted to a significant extension of the effective lead time of NWP. Because they used the Kolmogorov flow for their experiments, the flow is highly turbulent and has a limited predictable range. Different from the intensively turbulent Kolmogorov flow, the real atmosphere is more of a mixture of geostrophic turbulence and waves (Vallis 2017) and exhibits quasi-periodic behaviors such as the Madden–Julian oscillation (Zhang 2005) and baroclinic annular mode (Thompson and Barnes 2014). We design a new case with better predictability than the Kolmogorov flow but which also contains an instability mechanism that is periodically activated. We will investigate whether the DL training with a differentiable dynamical core can significantly extend the effective forecast range in this case.
The advantage of using high-resolution simulations for DL training with automatic differentiation is that we can have spatially dense data and make the time interval between reference states relatively short, thus reducing the difficulty of training. To extend the lead time of effective prediction, naïvely, we want to include longer segments of high-resolution simulation data in the training of DL-based parameterization. However, this can become computationally expensive when many consecutive steps from the high-resolution benchmark simulation are included. In addition, high-resolution simulations are different from nature and may become unreliable for long-term prediction. With the development of Earth system observation technologies, we have accumulated a lot of spatiotemporally sparse observation data. Using these observation data for posttraining enhancement of a DL parameterization model through transfer learning appears to be a potentially valuable idea.
This study focuses on studying the strategy of developing subgrid-scale turbulence parameterization with DL models, which are coupled with a differentiable dynamical core during model training. The first issue we want to investigate is the performance of the online training approach and some strategies to improve the DL model training with high-resolution benchmark simulation data. The second question we want to answer is whether and how we can adjust the pretrained DL model using limited observational data to achieve better generalizability or further performance improvement. Below, we first introduce our idealized case and its predictability issue in section 2a. DL training data, models, and strategies are introduced in sections 2b and 2c. Results are presented for pretraining and transfer learning in section 3, followed by a summary and discussion in section 4.
2. Experimental design and methods
a. An idealized case with BVE
The barotropic vorticity equation (BVE) describes the conservation of vorticity in a two-dimensional flow. Here, we design a new case with better predictability than Kolmogorov flow by periodically suppressing the forcing to the flow.
1) Governing equations
The natural atmosphere behaves like waves in the absence of instability and therefore is of substantial predictability. However, baroclinic, convective, or other instability may occur at some time and location, exhibiting quasi-periodic behaviors (Zhang 2005; Thompson and Barnes 2014). Figure 1 shows the vorticity field during one cycle of our forcing. When the forcing is weak, the flow is dominated by larger eddies and may be less dependent on subgrid-scale process (SGS) parameterization. When the forcing is active, the SGS process may be primarily unresolved in the middle shear zone; there are also regimes where eddies are partially resolved, so they are more like the gray zones of NWP. This forcing seems to be a better choice than having a laminar flow or intensively turbulent flow as an analogy for the actual weather forecast.
2) Predictability limit
To further illustrate the different predictability of the Kolmogorov flow and our BVE setting, we designed a predictability limit test based on different simulation configurations. We used a pseudospectral spatial differentiation and exponential time differencing Runge–Kutta fourth-order (ETDRK4) method for time integration (Kassam and Trefethen 2005). In this test, we performed four simulation groups, TRUTH, ALT, HighRes, and LowRes, which were also used later in the machine learning part. The TRUTH group is a direct numerical simulation (DNS) run on a 1024 × 1024 grid on a domain of 2π × 2π. The first 200 time units of the TRUTH run were discarded, and we collected one snapshot every 2.5 time units for a total of 100 time slices, which are used as initial conditions for other groups of simulations. The ALT, meaning alternative truth, is also run on the 1024 × 1024 grid but carries Fourier-space vorticity each integration step, whereas the TRUTH group carries physical-space vorticity states. The only difference between TRUTH and ALT is a roundoff error, and thus the latter represents an upper limit on the ability to predict TRUTH with a lower-resolution, parameterized simulation. The HighRes group runs on 256 × 256 grids, which stands for high-resolution simulations in NWP, which has a high accuracy for short time periods; however, it has a high computational cost and a low accuracy for long time periods. The LowRes group, representing the operational model in NWP, runs fast on 64 × 64 grids but needs good SGS models for a breakthrough in the lead time of prediction.
Figure 2 compares the predictability of the Kolmogorov flow and the BVE with periodic shear forcing through the root-mean-square error (RMSE), R2 (with R being the correlation coefficient), and instantaneous vorticity field for simulations with one initial condition. The forcing in the Kolmogorov flow follows the setting in Kochkov et al. (2021), which has a wavenumber-4 structure in the y direction and creates constant (in time) forcing for zonal wind shears. It is so turbulent that within about 10 time units, even at the highest accuracy in our study, rounding errors can develop into significant differences, making it unsuitable as an analogy for numerical weather forecasting. In contrast, the BVE with periodic forcing has better predictability with slower error growth and correlation decreasing. Unresolved or poorly resolved scales change the flow after a relatively longer period than the Kolmogorov flow.
Figure 3 shows the average error growth and R2 curves as well as their standard deviation of 100 Alt, HighRes, and LowRes simulations, respectively, for the periodic shear forcing setup. We arbitrarily set the threshold for a forecast to remain valid as having R2 greater than 0.5. The effective forecast leading time for the LowRes models is around 11 time units (∼1 forcing cycle), which is to be improved through DL methods using HighRes simulations, which have an effective forecast leading time around 21 time units (∼2 forcing cycles). The upper limit of the predictability for our BVE problem is about 37 time units based on the ALT simulations. Having these significantly different effective lead times makes it easy to quantify the benefits of DL-based parameterizations with regard to improving forecasting performance.
b. SGS processes
c. Model-consistent learning
1) Pretraining
We represent
We generate a TRUTH run with 20 000 time units. The training set contains data from 200 to 1000 time units for the pretraining, which equals 80 cycles of the forcing. We pick one snapshot every 2.5 time units and apply a local average filter to get an initial condition for the HighRes simulation at the resolution of 256 × 256. We run the HighRes simulation for each initial condition for only 5 time units. This ensures the HighRes simulations accurately approximate TRUTH with an average R2 larger than 0.95 according to the predictability limit test. Then the training set is constructed as the collection of n consecutive snapshots of the HighRes vorticity field that are coarse grained to 64 × 64. In this experiment, we perform the training for n = 8, 16, 24, 32 to see if including more time steps in a training sample series can improve the performance. The case n = 1 is also tested to see the performance of the CNN trained without interacting with the dynamical core.
2) Transfer learning
3. Results
We select a TRUTH snapshot every 2.5 time units for the period from 15 000 to 19 500 time units, which is 1400 forcing cycles later than the training dataset time, and the resulting collection of snapshots are used as the initial conditions to run a series of LowRes, HighRes, and LowRes with Smagorinsky SGS (LowRes_Smagorinsky) simulations. To evaluate the performance of the DL-based SGS models we obtained, we conducted the online test, which couples the trained neural works with the dynamic core, and the accuracy of predicted vorticity is evaluated.
a. Pretraining
Here, we first report the model’s performance during 15 000–19 500 time units, the whole test period. Figure 4 shows the averaged coefficient of determination R2 and RMSE curves for forecasts using the BVE model with dynamic core coupled with the pretrained SGS models. The Smagorinsky SGS model performs worse in our BVE configuration than LowRes, which does not use any SGS model. CNN1, which includes only one look-ahead step in its training, outperforms LowRes in the first time unit, and then the error grows and R2 drops rapidly, giving forecasts significantly worse than LowRes. Somewhat surprisingly, the performance of CNN1 is even significantly worse than the simulation using the Smagorinsky scheme (LowRes_Smagorinsky) after 1 time unit of lead time. This result suggests that the a priori learning approach for our problem is of poor generalizability.
Every other DL model with more than one look-ahead step provides notable improvements in error growth and R2, which are progressively better as the number of look-ahead steps increases. The different performance associated with the number of look-ahead steps is most discernable before approximately 20 time units of lead time, after which all of those models lose the validity in their predictions. The improvements are further quantitatively reported in Table 1. The effective forecast lead time refers to a forecast period during which R2 ≥ 0.5 with respect to TRUTH. The percentage of forecast lead time extended and RMSE is evaluated based on the LowRes simulation’s effective forecast lead time, which is 11 time units. The percentage of reducible RMSE that is in fact reduced is the ratio of the RMSE reduced to the difference in the RMSE of LowRes and HighRes simulations when LowRes becomes invalid. It is referred to as “reducible” RMSE because the pretraining dataset comes from HighRes; thus, it is reasonable to assume that the RMSE of HighRes is the lower bound of the RMSE of the simulations using our pretrained SGS model.
Quantitative evaluation of online tests of models. The RMSE is evaluated when LowRes becomes invalid (R2 < 0.5; approximately at the lead time of 11 time units), and the reducible RMSE is the difference between HighRes RMSE and LowRes RMSE, evaluated at the lead time of 11 time units, i.e., reducible RMSE = RMSEHighRes|t=11 − RMSELowRes|t=11. CNN16_TL is a fine-tuned CNN16. The boldface font in the table indicates the best result among pretrained models.
We first compare the models trained with different numbers of look-ahead steps of Lmse. In Table 1, including 8 look-ahead steps to train a CNN results in a 22.73% extension of effective forecast lead time and 38.21% reduction of RMSE relative to LowRes. These improvements can be further increased to 50% and 80.15%, respectively, if the look-ahead step is 32. Considering the test set is 1400 forcing cycles later in time than the training set with only 80 cycles, the CNN models trained with the auto-differentiable dynamic core exhibit substantial generalizability in our BVE case. Also, increasing the number of look-ahead time steps in the training phase can significantly enhance the model’s forecasting ability.
CNN16_Cov, which has the same CNN architecture as others but introduces the covariance matrix to normalize the loss function, remarkably outperforms CNN16. Both are trained with 16 look-ahead steps. CNN16_Cov achieves almost the same performance as CNN32 for effective forecast lead time and RMSE. Changing the R2 threshold will not change the conclusions above; see appendix B for the sensitivity test. It is also worth mentioning that the training of CNN16_Cov converges after 20 epochs while CNN32 converges after around 100 epochs. These improvements are statistically significant, and detailed hypothesis tests are discussed in appendix C.
It is natural to ask whether the improvements brought about by DL models can be preserved when the test set is farther away from the training set in time. For a good parameterization, we hope the improvements can be time invariant. Therefore, we divide the test period into nine successive segments and report the time-averaged R2 and RMSE series in Fig. 5. The curves do not show an increasing trend in RMSE or a decreasing trend in R2. The online test performance is not influenced by how far in time it is conducted relative to the training set. This insensitivity indicates that the DL parameterization trained with the auto-differentiable dynamic core is time invariant in our BVE case.
The time-averaged enstrophy spectrum can be used to evaluate the statistical performance of the models and is shown in Fig. 6. The LowRes simulation cannot sufficiently resolve high-wavenumber vorticity and underpredicts enstrophy at medium and large wavenumbers. The enstrophy spectrum of LowRes_Smagorinsky is significantly different from that of other models for both large and small wavenumbers. All DL models trained with the auto-differentiable dynamic core provide a better simulation of the smallest scales of the vorticity field relative to LowRes. However, including more look-ahead steps does not help increase the performance here. CNN32 gives the least-satisfactory simulation among all the DL models for the smallest scales, contrasting with its highest effective forecast lead-time extension and RMSE reduction. CNN8 gives the best small-scale enstrophy distribution. However, it cannot sufficiently represent large-scale structures. CNN16_Cov slightly improves the smallest-scale simulation when compared with CNN16 and CNN24. One possible reason is that the neural networks were optimized only based on accumulated mean square error, without any penalty on the spectral distribution of enstrophy. Another possible reason is the operator used for coarse graining. According to Ross et al. (2023), the online spectral results are related to filtering and the coarse graining choice. So the designing of dataset is also critical.
Overall, the CNN turbulence models trained with more look-ahead steps of Lmse exhibit longer extensions of effective forecast lead time and forecast error reduction. These improvements are time invariant, suggesting the generalizability of our SGS model training approach. However, including more look-ahead steps does not help simulate the smallest-scale dynamics better, and the reason behind this is unclear. Normalizing Lmse by covariance matrices elevates model performance and shortens model training time. There are also other reasons why CNN16_Cov is preferable to CNN32, which we will discuss in section 3c.
b. Transfer learning
We first tried training a randomly initialized CNN using temporally sparse data, but the training hardly converged. In the forward pass of the first training step, the randomly initialized CNN adds noise after every integration step of the BVE model; the error can grow considerably large before the first MSE calculation, which is conducted after 40 steps of dynamical core integration and SGS model “corrections.” This makes training from scratch using temporally sparse data hard and underscores the importance of the pretraining using “continuous” high-resolution data.
Figure 7 shows the transfer-learning results using the pretrained CNN models. Using the same neural network structure and taking weights of the pretrained CNN16 as initialization, the fine-tuned CNN16 achieves substantial improvements. For forecast RMSE, the fine-tuned CNN16 outperforms HighRes for up to 11 time units with very small standard deviance. The effective forecast lead time is 19 time units for the fine-tuned CNN16, a 72.73% extension relative to LowRes. In comparison with the best-performed pretrained CNN16_Cov, CNN16_TL extends 22.73% of forecast lead time and reduces 25.41% of reducible RMSE, although R2 is still not as good as HighRes.
Note that the same training strategy and network structure are used; we observe that choosing a pretrained model to fine-tune is critical. However, it still needs to be determined how to make a good choice. A better-pretraining R2 and RMSE does not guarantee a better fine-tuned network. The performances of CNN8, CNN24, and CNN32 after transfer learning are slightly improved relative to CNN16. The geometry of the loss function used in our transfer learning, which is an accumulated MSE over tens of integration steps, may be very complicated. Therefore, it may not be surprising that they end up staying at different local minimums. As the prediction is sensitive to tiny differences, the online-test results of the fine-tuned networks can be sensitive to the choice of the pretrained initial weight for transfer learning.
Figure 8 illustrates one instance of 30-time-unit predictions of different simulations starting from a TRUTH initial condition at t = 16 525 time units. After 6 time units, each model can forecast the position of vortices well. There are some numerical oscillations at small scales in the LowRes simulation, which can be removed by our DL-based SGS models. After 15 time units, LowRes cannot predict the correct position of the two vortices that should be near the left edge, but the models using the DL-based turbulence models can predict with good accuracy. After 18 time units, CNN16 and CNN16_Cov lost their ability to simulate most of the vortices, which, however, can still be predicted by CNN16_TL. After 30 time units, the CNN16_TL simulation well simulates vortices that cannot be successfully simulated by high-resolution models, as highlighted in Fig. 8. This instance indicates the fine-tuned model has the potential to outperform HighRes simulation in some cases, although, on average, after 30 time units, the forecast produced by CNN16_TL has lower R2 relative to HighRes.
c. Computational cost
Recall that n is defined as the number of look-ahead steps. For pretraining, there are n times of iteration through the neural networks and dynamic core for calculating the loss and its gradient. Therefore, the training can be time-consuming. Table 2 reports the wall time required to train a batch of 8 samples for the same CNN with different look-ahead steps. The test is conducted on a single NVIDIA RTX3090 with 80% memory preallocation. The training time per batch is linear with respect to n. Using a covariance-matrix-normalized loss function increases the training time slightly because of additional matrix multiplication. CNN16_Cov takes one-half of the training time of CNN32 per batch, yet they perform almost identically, and CNN16_Cov gives a better enstrophy spectrum.
The wall time (s) required to train a batch of 8 samples.
Another issue to consider is the additional computation time due to using the DL-based SGS model coupled with the dynamic core when the model is used to forecast. Table 3 shows the time required for different calculation processes in forecasting starting from one initial condition. The inference of the DL model takes 36.38% of the time of the whole simulation. Therefore, the additional cost of using a DL-based SGS model is not negligible, but it is affordable and within a reasonable range considering the complexity of turbulence. As a reference, radiation is usually the most computationally expensive part of an atmospheric model; our experience with the latest version of Cloud Model 1, release 21.0 (CM1; Bryan and Fritsch 2002), is that the one-step computation time of radiation, using the RRTMG scheme (Iacono et al. 2008), is approximately 11 times of the computation time of CM1’s dynamical core (mainly advection). Most modelers chose to calculate the radiation only once in multiple model steps to save the overall computational cost. It is also a possible strategy for DL-based SGS models to save computation time in applications.
The wall time (ms) required for a 10-time-unit (1 cycle) forecast with HighRes dynamical core, LowRes dynamical core, and DL parameterization.
4. Summary and discussion
In this study, we demonstrated a potential strategy to enhance the prediction skills of NWP models using an idealized BVE case. The strategy starts with pretraining a DL-based SGS process model using high-resolution simulation data, and then we fine-tune the pretrained DL model with transfer learning using observation data. The pretraining is accomplished by incorporating an auto-differentiable dynamic core and exhibits considerable extension of forecast lead time and reduction in RMSE. The DL-based SGS model has significant generalizability as the improvement is time invariant. Using a covariance matrix to normalize the loss function can also improve the performance at a lower computational time cost.
The pretrained SGS model is further fine tuned with the autodifferentiable dynamic core and temporally sparse data for transfer learning. The forecast ability of the models can be further enhanced. When this strategy is applied to an operational forecasting model, we can train a DL model using a relatively large high-resolution dataset once and keep updating it as we accumulate more observation data. Our study shows that this strategy is promising in substantially improving NWP forecast skills. As the observational data are always sparse and noisy, performing transfer learning from these data would be an interesting future topic.
However, there are still some remaining challenges. The first challenge is the representativeness of our idealized study case, which is a two-dimensional flow. Previous studies (Kochkov et al. 2021; Frezat et al. 2022) also apply differentiable physics-based training on two-dimensional flows. A necessary follow-up research direction is to investigate this strategy in three-dimensional flows. In a three-dimensional problem, the computational and dynamical complexity is significantly higher; therefore, the look-ahead steps can be limited, and the choice of DL models also needs to be adjusted accordingly.
Another potential issue is related to the model generalizability implied by our study. The statement that our model is “time invariant” is based on our case with only one temporally periodic forcing. The real atmosphere and climate systems possess forcing variability at different time scales, from diurnal cycles to decadal oscillations, thus requiring more data to encompass all those variabilities in pretraining. However, most likely, we cannot have sufficient samples to represent low-frequency climate variability or global warming–induced changes; therefore, the DL-based SGS models may encounter “out of sample” features that hinder their performance. This issue highlights the importance of updating DL models using transfer learning investigated in our study. A next-generation NWP probably should keep learning from a continuous stream of observation data to adjust its behavior, which involves a relatively new field called lifelong learning or continual learning (Parisi et al. 2019).
We observed some inconsistencies in our study. For pretraining, we saw that RMSE and R2 improved, but the enstrophy spectrum did not. Also, a pretrained weight with better online-test RMSE and R2 may not be better performed when working as the initial weight for transfer learning. We believe that one of the causes for these inconsistencies comes from how we optimize and evaluate our model, and the conventional loss functions and metrics may not be sufficient. It would be meaningful to further explore possibilities to design loss functions and metrics to train and identify a deep learning model with better consistency.
The approach to developing the DL-based SGS turbulence model proposed in this study requires the dynamic core of interest to be automatically differentiable; thus, the availability of an auto-differentiable NWP dynamical core is essential. There are now projects that noted the potential of differentiable programming for Earth system modeling, such as the differentiable programming in Julia for Earth system modeling (DJ4Earth) program (Edelman et al. 2021). We believe that the deep learning–based turbulence models combined with an automatically differentiable dynamic core will bring us a significant step forward in advancing modern NWP and climate models, which will become capable of exploiting high-resolution simulations, observations, and deep learning (Schneider et al. 2017).
Acknowledgments.
The authors acknowledge the fund support by the Research Grants Council of Hong Kong SAR, China (HKUST-16301721) and the Center for Ocean Research in Hong Kong and Macau (CORE), a joint research center between the Qingdao Pilot National Laboratory for Marine Science and Technology (QNLM) and the Hong Kong University of Science and Technology (HKUST).
Data availability statement.
The code for the dynamical core and deep learning supporting the findings of this study is available online (https://github.com/YONGQUAN-QU/BVEX).
APPENDIX A
Neural Network Architectures and Hyperparameters
All of the layers and operations that compose the CNN are shown in Fig. A1. Following the basic ConvNet structure used by Kochkov et al. (2021), we set the kernel size as 3 × 3. For detailed configuration of each convolutional layer, we set the interwindow stride, input dilation, and kernel dilation as 1. This makes the width and height of the output feature map from each convolutional layer the same as that of the input feature map. Further, we did not include any pooling layer; therefore, the spatial dimensions of the input and output of the CNN are identical. The input of the CNN is a batch of tensors of
We did not include batch normalization and dropout. The activation function for each convolution layer except for the last one is Gaussian error linear unit (GELU). As compared with the rectified linear unit (ReLU), GELU introduces increased curvature and nonmonotonicity, which potentially makes it easier to approximate complicated functions (Hendrycks and Gimpel 2016).
APPENDIX B
Sensitivity Test
We conducted a sensitivity test to see how the percentage of forecast lead time extended changes with respect to the rising R2 threshold. The range of the R2 threshold that we tested is 0.5–0.9. As shown in Fig. B1, changing the threshold does not change the relative order of how well individual models perform.
APPENDIX C
Test of Significance
The distributions of R2 are not normal; we applied a nonparametric Mann–Whitney U rank test for the distributions of R2 of different models at given forecast lead time t. We have 200 samples of R2 for a given model at given forecast lead time t. The null hypothesis to be tested is that two distributions of R2 are identical. The test results are shown in Fig. C1. The effective lead times for different models that are compared in Table 1 lie in the intersection of the significant time intervals of all the single-sided Mann–Whitney U rank test. So the improvements of R2, and therefore the effective forecast lead time extensions, brought by online-trained models and the fine-tuned model mentioned in Table 1 are statistically significant with p value < 0.05. There is also a two-sided Mann–Whitney U rank test, which is colored in pink in Fig. C1, that illustrates that most of the time we cannot reject the corresponding null hypothesis that the distributions of R2 of CNN16_TL and CNN32 are identical. The statistical significance makes us more optimistic to give a positive answer to the question posed by the paper’s title.
REFERENCES
Bauer, P., A. Thorpe, and G. Brunet, 2015: The quiet revolution of numerical weather prediction. Nature, 525, 47–55, https://doi.org/10.1038/nature14956.
Baydin, A. G., B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, 2018: Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res., 18, 1–43.
Beucler, T., M. Pritchard, S. Rasp, J. Ott, P. Baldi, and P. Gentine, 2021: Enforcing analytic constraints in neural networks emulating physical systems. Phys. Rev. Lett., 126, 098302, https://doi.org/10.1103/PhysRevLett.126.098302.
Bradbury, J., and Coauthors, 2018: JAX: Composable transformations of Python+ NumPy programs, version 0.2.5. Accessed 4 July 2022, http://github.com/google/jax.
Brenowitz, N. D., B. Henn, J. McGibbon, S. K. Clark, A. Kwa, W. A. Perkins, O. Watt-Meyer, and C. S. Bretherton, 2020: Machine learning climate model dynamics: Offline versus online performance. arXiv, 2011.03081v1, https://doi.org/10.48550/arXiv.2011.03081.
Bryan, G. H., and J. M. Fritsch, 2002: A benchmark simulation for moist nonhydrostatic numerical models. Mon. Wea. Rev., 130, 2917–2928, https://doi.org/10.1175/1520-0493(2002)130<2917:ABSFMN>2.0.CO;2.
Chantry, M., S. Hatfield, P. Dueben, I. Polichtchouk, and T. Palmer, 2021: Machine learning emulation of gravity wave drag in numerical weather forecasting. J. Adv. Model. Earth Syst., 13, e2021MS002477, https://doi.org/10.1029/2021MS002477.
Chow, F. K., C. Schär, N. Ban, K. A. Lundquist, L. Schlemmer, and X. Shi, 2019: Crossing multiple gray zones in the transition from mesoscale to microscale simulation over complex terrain. Atmosphere, 10, 274, https://doi.org/10.3390/atmos10050274.
Chung, D., and G. Matheou, 2014: Large-eddy simulation of stratified turbulence. Part I: A vortex-based subgrid-scale model. J. Atmos. Sci., 71, 1863–1879, https://doi.org/10.1175/JAS-D-13-0126.1.
Donahue, A. S., and P. M. Caldwell, 2018: Impact of physics parameterization ordering in a global atmosphere model. J. Adv. Model. Earth Syst., 10, 481–499, https://doi.org/10.1002/2017MS001067.
Durran, D. R., 2010: Numerical Methods for Fluid Dynamics: With Applications to Geophysics. Texts in Applied Mathematics, Vol. 32, Springer Science & Business Media, 516 pp.
Edelman, A., C. Rackauckas, and C. N. Hill, 2021: Earth system model applications. Dj4earth, https://dj4earth.github.io/research/applications/.
Espeholt, L., and Coauthors, 2022: Deep learning for twelve hour precipitation forecasts. Nat. Commun., 13, 5145, https://doi.org/10.1038/s41467-022-32483-x.
Frezat, H., J. L. Sommer, R. Fablet, G. Balarac, and R. Lguensat, 2022: A posteriori learning for quasi-geostrophic turbulence parametrization. J. Adv. Model. Earth Syst., 14, e2022MS003124, https://doi.org/10.1029/2022MS003124.
Gentine, P., M. Pritchard, S. Rasp, G. Reinaudi, and G. Yacalis, 2018: Could machine learning break the convection parameterization deadlock? Geophys. Res. Lett., 45, 5742–5751, https://doi.org/10.1029/2018GL078202.
Gross, M., and Coauthors, 2018: Physics–dynamics coupling in weather, climate, and Earth system models: Challenges and recent progress. Mon. Wea. Rev., 146, 3505–3544, https://doi.org/10.1175/MWR-D-17-0345.1.
Hendrycks, D., and K. Gimpel, 2016: Gaussian error linear units (GELUs). arXiv, 1606.08415v4, https://doi.org/10.48550/arXiv.1606.08415.
Iacono, M. J., J. S. Delamere, E. J. Mlawer, M. W. Shephard, S. A. Clough, and W. D. Collins, 2008: Radiative forcing by long-lived greenhouse gases: Calculations with the AER radiative transfer models. J. Geophys. Res., 113, D13103, https://doi.org/10.1029/2008JD009944.
Kassam, A.-K., and L. N. Trefethen, 2005: Fourth-order time-stepping for stiff PDEs. SIAM J. Sci. Comput., 26, 1214–1233, https://doi.org/10.1137/S1064827502410633.
Kochkov, D., J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer, 2021: Machine learning–accelerated computational fluid dynamics. Proc. Natl. Acad. Sci. USA, 118, e2101784118, https://doi.org/10.1073/pnas.2101784118.
Laloyaux, P., M. Bonavita, M. Chrust, and S. Gürol, 2020: Exploring the potential and limitations of weak-constraint 4D-Var. Quart. J. Roy. Meteor. Soc., 146, 4067–4082, https://doi.org/10.1002/qj.3891.
Leufen, L. H., and G. Schädler, 2019: Calculating the turbulent fluxes in the atmospheric surface layer with neural networks. Geosci. Model Dev., 12, 2033–2047, https://doi.org/10.5194/gmd-12-2033-2019.
Parisi, G. I., R. Kemker, J. L. Part, C. Kanan, and S. Wermter, 2019: Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54–71, https://doi.org/10.1016/j.neunet.2019.01.012.
Ross, A. S., Z. Li, P. Perezhogin, C. Fernandez-Granda, and L. Zanna, 2023: Benchmarking of machine learning ocean subgrid parameterizations in an idealized model. J. Adv. Model. Earth Syst., 15, e2022MS003258, https://doi.org/10.1029/2022MS003258.
Schneider, T., S. Lan, A. Stuart, and J. Teixeira, 2017: Earth system modeling 2.0: A blueprint for models that learn from observations and targeted high-resolution simulations. Geophys. Res. Lett., 44, 12 396–12 417, https://doi.org/10.1002/2017GL076101.
Smagorinsky, J., 1963: General circulation experiments with the primitive equations: I. The basic experiment. Mon. Wea. Rev., 91, 99–164, https://doi.org/10.1175/1520-0493(1963)091<0099:GCEWTP>2.3.CO;2.
Tabeart, J. M., S. L. Dance, A. S. Lawless, S. Migliorini, N. K. Nichols, F. Smith, and J. A. Waller, 2020: The impact of using reconditioned correlated observation-error covariance matrices in the Met Office 1D-Var system. Quart. J. Roy. Meteor. Soc., 146, 1372–1390, https://doi.org/10.1002/qj.3741.
Thompson, D. W. J., and E. A. Barnes, 2014: Periodic variability in the large-scale Southern Hemisphere atmospheric circulation. Science, 343, 641–645, https://doi.org/10.1126/science.1247660.
Um, K., R. Brand, Y. R. Fei, P. Holl, and N. Thuerey, 2020: Solver-in-the-loop: Learning from differentiable physics to interact with iterative PDE-solvers. Proc. 34th Int. Conf. on Neural Information Proc. Systems, Vancouver, BC, Canada, Curran Associates Inc., 6111–6122.
Vallis, G. K., 2017: Atmospheric and Oceanic Fluid Dynamics: Fundamentals and Large-Scale Circulation. 2nd ed. Cambridge University Press, 946 pp.
Weyn, J. A., D. R. Durran, and R. Caruana, 2019: Can machines learn to predict weather? Using deep learning to predict gridded 500-hPa geopotential height from historical weather data. J. Adv. Model. Earth Syst., 11, 2680–2693, https://doi.org/10.1029/2019MS001705.
Zhang, C., 2005: Madden–Julian Oscillation. Rev. Geophys., 43, RG2003, https://doi.org/10.1029/2004RG000158.