## Abstract

Accurate and real-time sea surface salinity (SSS) prediction is an elemental part of marine environmental monitoring. It is believed that the intrinsic correlation and patterns of historical SSS data can improve prediction accuracy, but they have been not fully considered in statistical methods. In recent years, deep-learning methods have been successfully applied for time series prediction and achieved excellent results by mining intrinsic correlation of time series data. In this work, we propose a dual path gated recurrent unit (GRU) network (DPG) to address the SSS prediction accuracy challenge. Specifically, DPG uses a convolutional neural network (CNN) to extract the overall long-term pattern of time series, and then a recurrent neural network (RNN) is used to track the local short-term pattern of time series. The CNN module is composed of a 1D CNN without pooling, and the RNN part is composed of two parallel but different GRU layers. Experiments conducted on the South China Sea SSS dataset from the Reanalysis Dataset of the South China Sea (REDOS) show the feasibility and effectiveness of DPG in predicting SSS values. It achieved accuracies of 99.29%, 98.44%, and 96.85% in predicting the coming 1, 5, and 14 days, respectively. As well, DPG achieves better performance on prediction accuracy and stability than autoregressive integrated moving averages, support vector regression, and artificial neural networks. To the best of our knowledge, this is the first time that data intrinsic correlation has been applied to predict SSS values.

## 1. Introduction

Sea surface salinity (SSS) is one of the most critical factors in the study of climate forecasting (Cronin and McPhaden 1999), global water cycle (Batteen et al. 1995), sea ice observation, marine disaster monitoring (Reul et al. 2012; Hasson et al. 2013), marine ecosystems (Gabarró et al. 2004) and military field. Accurate and real-time SSS prediction is an elemental part of marine environmental monitoring. Computational methods for predicting SSS values can be divided into two categories: statistical methods and machine-learning methods.

Statistical methods refer to physical oceanic models with fixed functions and some assumptions, in which the values of involved parameters can be computed with empirical data. For example, multivariate adaptive regression spline model (Urquhart et al. 2012) and multiple linear regression model (Qing et al. 2013) have been used for SSS prediction. These statistical methods have several merits:

The models are usually simple models and explicit to understand.

The solution to them is usually easier than machine-learning methods and takes less time.

Because of the nonlinear and stochastic nature of SSS data, statistical methods cannot describe the unique nature well and result in bigger prediction errors than machine-learning methods.

Machine-learning methods can “learn” internal patterns or correlations from series data. Machine-learning methods—see, for example, genetic algorithm (Chen et al. 2017) or artificial neural network (ANN) (Urquhart et al. 2012)—have been used for SSS prediction. Machine-learning methods have superior expressive abilities and enabled to fit nearly all the functions to arbitrary precision. While ANN cannot learn the correlation of sequence data, so it needs to choose the time feature manually, which may lead to unsatisfactory prediction results. Similarly, the genetic algorithm often falls into a nonconvex optimization problem and may be easy to encounter a local minimum. The optimization of machine-learning methods may be hard, and overfitting is really a tough problem to be solved (Jiang et al. 2018; Siahkoohi et al. 2019).

In recent decades, deep-learning methods, especially recurrent neural network (RNN), have been widely used for time series data processing and value prediction. RNN introduces the recurrent unit structure and allows the internal connection between the hidden units, so it is suitable for analyzing and processing time series data (Krizhevsky et al. 2012). It has the gradient vanishing problem in RNN, that is, as time interval increases, RNN will lose the ability to learn historical information from the past. Hochreiter presented the long short-term memory (LSTM) network (Hochreiter and Schmidhuber 1997), which can solve the gradient vanishing problem by introducing “gate control unit,” and has been widely used in the field of time series data prediction. It usually takes a long time to train an LSTM due to its complex internal structure. To speed up training, Cho et al. (2014) proposed a gated recurrent unit (GRU) network model based on the LSTM network model in 2014, which can maintain the prediction effect with fewer training parameters than LSTM. In 2017, LSTM was applied for predicting values of sea surface temperature (SSH) and overcome the performance of classical support vector regression (SVR) method (Zhang et al. 2017; Ratto et al. 2019). While no result has been reported by using LSTM to predict SSS values yet.

In this work, LSTM models are proposed to predict SSS values. Considering the advantages of small calculation amount and fast convergence of GRU, we apply GRU to SSS prediction and propose the dual path GRU network (DPG), which consists of CNN and RNN. The CNN part is composed of a 1D CNN without pooling, and the RNN part is composed of two parallel but different GRU layers. It is aimed to use this structure to extract both the overall long-term pattern of time series and the local short-term pattern of time series. Experiments conducted on the SSS from the Reanalysis Dataset of the South China Sea (REDOS) show the feasibility and effectiveness of DPG in predicting SSS values. It achieved an accuracy of 99.29%, 98.44%, and 96.85% in predicting the coming 1, 5, and 14 days, respectively. As well, DPG achieves better performance on prediction accuracy and stability than autoregressive integrated moving average (ARIMA), SVR, and ANN. To the best of our knowledge, this is the first time that data intrinsic correlation is applied to predict SSS values.

## 2. Method

In this section, it is described the SSS prediction as a time series forecasting problem, and then the DPG architecture is shown. At last, we introduce the objective function and the optimization strategy.

### a. Problem description

Regard SSS prediction as a curve-fitting problem and aim to use the historical SSS values to predict future SSS values. More formally, given a series of historical time series values *Y* = {*y*_{1}, *y*_{2}, …, *y*_{n}} where *n* is the historical step, it is aimed at predicting a series of future values in a rolling forecasting fashion as shown in Fig. 1. It starts by using values of elements in *Y*_{1} = {*y*_{1}, *y*_{2}, …, *y*_{n}} to predict the value of *y*_{n+1}, and then, we use *Y*_{2} = {*y*_{2}, *y*_{3}, …, *y*_{n+1}} to predict the *y*_{n+1}, and so on. In this way, we can predict the desired future SSS values from the very beginning sequential value *Y*_{1}. Note that each of the predicted values is based on previous prediction information. We are interested here in the task of prediction within 2 weeks, so we set *n* = 14; that is, the SSS values of the past 2 weeks are used to predict the SSS values of the next 2 weeks.

In Fig. 2, it exhibits the topological structure of the proposed DPG network, which is composed of a CNN module and an RNN module, whose function will be elaborated in details in the following two subsections.

### b. The CNN module

Convolutional neural network (CNN) is a downsampling network that slides convolution on input data by filter. In our DPG network (shown in Fig. 2), we use a 1D CNN without pooling layer as the first layer, which can extract local short-term pattern from long-term time series and reduce the amount of parameter. The *j*th filter sweeps through the input vector **V** and produces

where the asterisk denotes the convolution operation and the RELU function is RELU(*x*) = max(0, *x*). Output **h**_{j} is a vector, and the output matrix of the convolutional layer is of size (*n* − size_{f} + 1) × numb_{f}, where *n* denotes the step, size_{f} is the size of filters, and numb_{f} is the number of filters. After the convolution layer, a dropout layer is added, which can effectively reduce the occurrence of overfitting and achieve regularization effect to some extent (Srivastava et al. 2014; Lu et al. 2018).

### c. The RNN module

#### 1) Structure of GRU

GRU network model and LSTM network model have similar data flow in cells. GRU does not have a separate storage unit, which makes it more efficient in data training. Figure 3 shows the typical structure of a GRU cell. There are two kinds of gates in GRU: the reset gate *r*_{t} and the update gate *z*_{t}. Both of them are activated by logistic sigmoid function. The reset gate *r*_{t} can determine how dependent the candidate state $h\u02dct$ is on the history state *h*_{t−1}. If the reset gate is with a smaller value, it means more historical information is ignored in the candidate state. The update gate *z*_{t} determines how much information from the historical state *h*_{t−1} is retained in the current state *h*_{t}, and how much information is received from the candidate state $h\u02dct$. When the update gate has a larger value, it means more candidate state information is received.

#### 2) Workflow of dual path GRU

The output of the dropout layer goes into two parallel but different GRU layers at the same time. One of them is the general-GRU layer and the other is the improved-GRU layer. That is why we call this structure dual path. The hidden units of the general-GRU layer are connected sequentially, and the hidden state of it at time *t* is computed as

where *W* and *U* are weight matrices, *b* is the bias term, ⊙ is the element-wise product, *σ* is the sigmoid function, tanh is the hyperbolic tangent function and *x*_{t} is the input of this layer at time *t*. The output of this layer is the hidden state at each time stamp.

In practice, GRU usually fail to capture very long-term correlation because of the problem of gradient vanishing. To make our model learn the overall long-term pattern of time series, we refer to the skip-connection structure from Residual Network (ResNet; He et al. 2016). Skip links are used in our GRU to construct the improved-GRU (IGRU) layer. Skip-connection structure can reflect the periodic pattern in real world. For example, if we need to predict the temperature at *t* o’clock, a classical trick is to check the records at *t* o’clock on historical days, especially yesterday, that is, 24 h ago. That is exactly what we want our skip-connection structure to learn from the data. There is a hyperparameter *s* in the improved-GRU layer, which means that the distance in the skip-connection structure is *s* hidden units. The hidden state of IGRU layer at time *t* can be calculated by

where the input of this layer is the output of the dropout layer, and *s* is the number of hidden units skipped through in IGRU layer. In data experiments, it is found that a well-tuned *s* can significantly increase the accuracy of the model.

At last, we use a fully connected layer to combine the outputs of the general-GRU (GGRU) layer and IGRU layer. The inputs to the fully connected layer include the hidden state of the GGRU layer at time stamp *t*, denoted by $htG$, and *s* hidden states of the improved-GRU layer from time stamp *t* − *s* + 1 to *t* denoted by $ht\u2212s+1I$, $ht\u2212s+2I$, …, $htI$. The output of the fully connected layer is computed as

where *W* is the weight matrices, *b* is the bias term, and $y\u02dct$ is the prediction result of the DPG at time stamp *t*.

### d. Objective function and optimization strategy

In the optimization process, adaptive moment estimation (Adam) is used to optimize the model parameters. Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent algorithm, and it can adapt the learning rate to the parameters, performing larger updates for infrequent parameters and smaller updates for frequent parameters (Kingma and Ba 2015). During the training process, this algorithm iteratively updates the weights and biases of each neuron in the network model so as to reduce the output value of the loss function to the optimal value. For the loss function of the model, we use the mean-square error function given by the following formula:

where *n* is the number of samples, *y*_{t} is the actual value, and $y\u02dct$ is the output value of the model at time stamp *t*.

## 3. Data experiments

### a. Dataset

We create an SSS dataset covering the South China Sea from the REDOS. This SSS dataset contains daily values from January 1992 to January 2012 (7305 days in total) and covers the South China Sea from 5° to 23°N and from 105° to 123°E, which is a 0.10° latitude multiplied by 0.10° longitude grid (180 × 180).

### b. Evaluation index

Prediction accuracy (ACC) and the root-mean-square error (RMSE) are used to evaluate the effectiveness of different prediction methods, and the calculation formulas are shown as follows:

where *X*_{ai} and *X*_{pi} are the actual and predicted values of point *i*, respectively. The RMSE can reflect the accuracy of the prediction well, and it is sensitive to the very large or very small errors in a set of results; that is, it can show the ability of the model to control absolute error (Phaisangittisagul 2016). For RMSE a lower value is better, whereas for ACC a higher value is better. In the following experiments, we use the area-average RMSE and area-average ACC, and the best results are highlighted in boldface in the tables.

### c. Results and analysis

Data experiments are run under the environment of the Ubuntu 16.04 64-bit operating system and use keras as the framework and tensorflow as the backend. It is a fairly common way to partition the dataset by either 6:2:2 or 8:1:1. It is believed that when the amount of data is small, having the proportion of training set be too large and the verification set be too small may lead to serious overfitting, while the small test set cannot effectively evaluate the model. So, with the amount of data increasing, it is not needed to set a large validation sets, after all, it is used to continuously tune the model to the optimal. In most cases, a small percentage of the verification set is sufficient. For instance, in the million-level dataset ImageNet, the training set is divided into 99.8% and the validation and test sets are both divided into 0.1%. Therefore, the South China Sea SSS dataset has been split into training set (80%), validation set (10%), and test set (10%) in chronological order.

We also compare the results of using 8:1:1 and 6:2:2 for splitting the dataset by training set, validation set, and test set, respectively. There are six hyperparameters in our DPG model, and they are the iteration times, the batch size, the number of filters numb_{f}, the size of filters size_{f}, the number of hidden units skipped through in IGRU layer *s*, and the number of nodes in GRU layer numb_{g}. Here, the number of nodes is set in both IGRU and GGRU layers. In data experiments, we gradually tune these hyperparameters.

Initially it needs to determine the iteration times and the batch size, because they can determine the optimal training scale for the entire network. The iteration time is chosen from {25, 50, 75}, and the batch size is chosen from {100, 150, 200, 250}. Table 1 shows the results on the SSS dataset with different iteration times and batch size. It can be seen from the results that the best performance occurs when iteration = 50 and batch = 200.

If the iteration times are too small, the training is far from making the parameters converge to the optimal value, whereas, when it is too large, the phenomenon of overfitting occurs obviously, which will reduce the accuracy of the model. Similarly, different batch sizes consider different sample information in the training process. When the batch size is too small, the model has difficult converging. If it is too large, the accuracy of the model changes little when compared with the optimal value, but the calculation amount and training time increase significantly.

According to the structural order of the model, it needs to determine the values of hyperparameters in the convolution layer. The number of filters numb_{f} is chosen from {25, 50, 100}, and the size of filters size_{f} is chosen from {2, 4, 6, 8}. Table 2 shows the results on the SSS dataset with different numbers of filters and sizes of filters. It can be seen from the results that the best performance occurs when numb_{f} = 100 and size_{f} = 6. The reason may be due to a large enough number of filters to better process input information and a suitable size of filters that can extract the most appropriate short-term local pattern from long-term time series.

For the RNN module, the number of hidden unit skips through IGRU layer *s* is chosen from {3, 5, 7, 9}, and the number of nodes in GRU layer numb_{g} is chosen from {25, 50, 75}. Table 3 shows the results on the SSS dataset with different *s* and numb_{g}. It can be seen from the results that the best performance occurs when numb_{g} = 50 and *s* = 7. The reason may be the same: enough nodes to better process input information and a suitable *s*, which is equal to 7. It means every seven of the hidden units in IGRU layer are a skipped connection, which corresponds to a week in real life; that is, our model suggests that the SSS value of the current day is more strongly related to the SSS value of the week before.

To test the prediction ability of our DPG model, we compare it with traditional time series prediction model ARIMA, and other basic machine-learning models, such as SVR, ANN, simple-RNN, and a new time series prediction model named time convolution network (TCN), which is also made up of CNN and RNN, just like our DPG model. All of the five methods are tested with our South China Sea SSS dataset. The performances of these methods are shown in Table 4.

For ARIMA, we set *p* = 1, *q* = 1, and *d* = 1. In SVR, we use the radial basis function (RBF) kernel for prediction and perform experiments using the “scikit-learn” software. Moreover, the kernel width for RBF is set as *σ* = 1.2, which is chosen by cross validation. For both ANN and simple-RNN, a three-tier architecture is selected, that is, input layer, hidden layer, and output layer. MSE is taken as the loss function, and Adam is used for optimization. For TCN, we set the number of filters numb_{f} = 32 and the size of kernel size_{k} = 3.

The reason we make predictions for 1 day, 5 days, and 2 weeks (14 days) in the future is to verify the DPG model in the cases of short-term prediction, midterm prediction, and long-term prediction, respectively. As well, we can see from Table 4 that the DPG model achieve the best prediction performance.

To show the prediction results of the DPG model more intuitively, we visualize the 5-day prediction and the ground truth in Fig. 4. It can be seen that there is a high similarity between the predicted value shown in Fig. 4a and the actual value shown in Fig. 4b. Moreover, we use the grayscale image to show the difference between the predicted value and actual value, which is shown in Fig. 4c. We find that the biggest differences are concentrated in the southwest of Taiwan province, this is because the SSS in these areas changes quite rapidly and is affected by many factors. Therefore, it is difficult to make accurate predictions based on the historical data of these regions.

The results are obtained by extracting both the overall long-term pattern of time series and the local short-term pattern of time series. The long-term pattern of time series of SSS can be learned with one of the GRU networks (DPG), but the local short-term pattern of time series was learned by DPG with CNN and RNN. Since the values of SSS in the whole year do not change a lot, the changing can be learned by GRU and DPG networks in both paths. GRU can learn long time patterns while forgetting nonimportant features, and DPG can catch the features happening in short time patterns.

From this point of view, it is not surprising that the accuracy achieves 99.29%, 98.44%, and 96.85% in predicting the coming 1, 5, and 14 days, respectively. It is, however, still a challenge to predict ahead at longer days, such as 30 days. As well, when season changes, the value of SSS will become unstable in a very short time series and is not easy to follow certain regular pattern. We need to design some methods by learning very short time changes with small-scale data.

## 4. Conclusions

In this paper, we proposed a novel deep-learning model for the task of SSS forecasting and explored the optimal parameters of this architecture by experiments. By combining the strengths of CNN and RNN, it can learn not only the overall long-term pattern of time series, but also the local short-term pattern of time series. When compared with ARIMA, SVR, TCN, and other models, DPG significantly showed state-of-the-art results in time series forecasting on the South China Sea SSS dataset.

For future research, there are several promising directions in extending the work. First of all, we can improve our model to predict salinity at different ocean depths by modifying hyperparameters such as the number of model layers and nodes or by introducing densely connected structures (Huang et al. 2017) into the GRU layer. In other words, the task changes from predicting sea surface salinity to predicting upper-ocean salinity, which will be a prediction of the 3D region that combines the temporal and spatial information. After that, we can use the predicted results to infer other variables related to sea salinity, such as freshwater flux and direction of the ocean current. Meanwhile, we can also make plans for future fishery and aquaculture on the basis of the predicted results.

## Acknowledgments

This work was supported by National Key Research and Development Program (2018YFC1406204; 2018YFC1406201), National Natural Science Foundation of China (Grants 61873280, 61672033, 61672248, and 61972416), Taishan Scholarship (tsqn201812029), Major projects of the National Natural Science Foundation of China (Grant 41890851), Natural Science Foundation of Shandong Province (ZR2019MF012), Fundamental Research Funds for the Central Universities (18CX02152A and 19CX05003A-6), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDA19060503), and the Chinese Academy of Sciences (Grant ISEE2018PY05).

## REFERENCES

## Footnotes

© 2020 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).