1 Introduction
Recently, detailed numerical simulations of highly nonlinear partial differential equations representing multiphysics problems became possible due to the increased power and memory of modern computers. Nevertheless, detailed simulations remain far too expensive to be used in various engineering tasks including design optimization, uncertainty quantification, and realtime decision support. For example, Bayesian calibration of subsurface reservoirs might involve millions of numerical simulations to account for the heterogeneities in the permeability fields (Elsheikh et al., 2012, 2013). Model Order Reduction (MOR) provides a solution to this problem by learning a computationally cheap model from a set of the detailed simulation runs. These reduced models are used to replace the highfidelity models in optimization and statistical inference tasks. MOR could be broadly categorized into three different classes: simplified physics based models, datafit black box models (surrogate models) (Petvipusit et al., 2014) and projection based reduced order models commonly referred to as ROM (Frangos et al., 2010).
Physics based reduced order models are derived from highfidelity models using approaches such as simplifying physics assumptions, using coarse grids, and/or upscaling of the model parameters. Datafit models are generated using regression of the highfidelity simulation data from the input to the output (Frangos et al., 2010; Petvipusit et al., 2014)
. In projection based ROM, the governing equations of the system are projected into a lowdimensional subspace spanned by a small number of basis functions commonly obtained by Galerkin projection. In all projection based ROM methods, it is generally assumed that the main solution characteristics could be efficiently represented using a linear combination of only a small number of basis functions. Under this assumption, it is possible to accurately capture the inputoutput relationship of a largescale fullorder model (FOM) using a reduced system with significantly fewer degrees of freedom
(Lassila et al., 2014; Berkooz et al., 1993).In projection based ROM, different methods could be used to construct the projection bases including: Proper Orthogonal Decomposition (POD), Krylov subspace methods, and methods based on truncated balanced realization (Lassila et al., 2014; Rewienski and White, 2003). ROM based on Proper Orthogonal Decomposition has been widely used to model nonlinear systems (Frangos et al., 2010; Rewienski and White, 2003). Despite the success of POD based methods, there exist a number of outstanding issues that limit the applicability of POD method as an effective reduced order modeling technique.
One issue is related to the cost of evaluating the projected nonlinear function and the corresponding Jacobian matrix in every Newton iteration. These costs create a computational bottleneck that reduces the performance of the resulting reduced order models. Some existing approaches for constructing a reduced order approximation to alleviate such computational bottleneck are gappy POD technique, sparse sampling, Missing Point Estimation (MPE), Best Point Interpolation Method (BPIM), Empirical Interpolation Method and Discrete Empirical Interpolation Method (DEIM)
(Willcox, 2006; Barrault et al., 2004; Chaturantabut and Sorensen, 2010). All these methods rely on interpolation schemes involving the selection of discrete spatial points for producing an interpolated approximation of the nonlinear functions. Moreover, these methods are developed especially for removing the computational complexity due to the nonlinear function in the PDE system after spatial discretization.Another issue is related to convergence and stability of the extracted ROM. Although POD based methods decrease the calculation times by orders of magnitude as a result of reducing the state variables dimension, this reduction goes hand in hand with loss of accuracy. This may result not only in inaccurate results, but also in slow convergence and in some cases model instabilities. Slow convergence means that many iterations are needed to reach the final solution and corresponds to an increase in the computational time. Divergence is even less desirable as it produces invalid simulation results.
Artificial Neural Networks (ANN) have found growing success in many machine learning applications such as computer vision, speech recognition and machine translation
(Hermans and Schrauwen, 2013; He et al., 2015; Hinton et al., 2012; Graves, 2013). Further, ANNs offer a promising direction for the development of innovative model reduction strategies. Neural network use in the domain of MOR is generally limited to constructing surrogate models to emulate the inputoutput relationship of the system based on the available simulation and experimental data (Koziel and Leifsson, 2013; Snoek et al., 2015). Neural networks have also been combined with POD to generate reduced order models without any knowledge of the governing dynamical systems (Winter and Breitsamter, 2014). One reason for developing such nonintrusive reduced order modeling methods is to address the main issues of PODGalerkin projection ROM technique such as stability and efficient nonlinearity reduction.Recently, Recurrent Neural Network (RNN) a class of artificial neural network where connections between units form a directed cycle have been successfully applied to various sequence modeling tasks such as automatic speech recognition and system identification of time series data (Hermans and Schrauwen, 2013; He et al., 2015; Hinton et al., 2012; Graves, 2013). RNN has been used to emulate the evolution of dynamical systems in a number of applications (Zimmermann et al., 2012; BailerJones et al., 1998) and hence has large potential in building surrogate models and reduced order models for nonlinear dynamical systems. The standard approach of modeling dynamical systems using RNN relies on three steps: (a) generating training samples from a number of detailed numerical simulations, (b) defining the suitable structure of RNN to represent the system evolution, and (c) fitting the RNN parameters to the training data. This pure datadriven approach is very general and can be effectively tuned to capture any nonlinear discrete dynamical system. However, the accuracy of this approach relies on the number of training samples (obtained by running a computationally expensive model) and on the selected RNN architecture. In addition, generic architectures might require a large number of parameters to fit the training data and thus increases the computational cost of the RNN calibration process.
Many types of recurrent neural network architectures have been proposed for modeling timedependent phenomena (Zimmermann et al., 2012; BailerJones et al., 1998). Among those, a recurrent neural network called Error Correction Neural Network (ECNN) (Zimmermann et al., 2012), that utilizes the misfit between the model output and the true output termed as model error to construct the RNN architecture. ECNN architecture (Zimmermann et al., 2012) augmented the standard RNN architecture by adding a correction factor based on the model error. Further, the correction factor in ECNN was activated only during the training of RNN. In other words, ECNN takes the time series of the reference output as an input to RNN for a certain length of the time period and after that time period (i.e. in future time steps), ECNN forecasts the output without the reference output as input from the fitted model.
In the current paper, we propose a physics aware RNN architecture to capture the underlying mathematical structure of the dynamical system under consideration. We further extend this architecture as a deep residual RNN (DRRNN) inspired by the iterative line search methods (Bertsekas, 1999; Tieleman and Hinton, 2012) which iteratively find the minimiser of a nonlinear objective function. The developed DRRNN is trained to find the residual minimiser of numerically discretized ODEs or PDEs. We note that the concept of depth in the proposed DRRNN is different from the view of hierarchically representing the abstract input to fit the desired output commonly adopted in standard deep neural network architectures (Pascanu et al., 2013a, b). The proposed DRRNN method reduces the computational complexity from to for fully coupled nonlinear systems of size and from to for sparse nonlinear systems obtained from discretizing timedependent partial differential equations.
We further combined DRRNN with projection based ROM ideas (e.g. POD and DEIM (Chaturantabut and Sorensen, 2010)) to produce an efficient explicit nonlinear model reduction technique with superior convergence and stability properties. Combining DRRNN with POD/DEIM, resulted in further reduction of the computational complexity form to , where is the size of the reduced order model.
The rest of this paper is organized as follows: Section 2.1 describes dimension reduction via PODGalerkin method followed by a discussion of DEIM in section 2.2. In Section 3, we present a brief background overview of deep neural networks (feedforward and recurrent), then we introduce the proposed DRRNN in section 4. In section 5, we evaluate the proposed DRRNN on a number of test cases. Finally, in Section 6 the conclusions of this manuscript are presented.
2 Background for Model Reduction
In this section, we first define the class of dynamical systems to be considered in this study. Following that, we present a general framework for reduced order modeling based on the concept of projecting the original state space into a lowdimensional, reducedorder space. At this point, we also discuss the computational bottleneck associated with dimensionality reduction for general nonlinear systems. Then we present the DEIM algorithm to reduce the time complexity of evaluating the nonlinear terms.
2.1 PODGalerikin
We consider a general system of nonlinear differential equations of the form:
(1) 
where is the state variable at time and
is a system parameter vector. The linear part of the dynamical system is given by the matrix
and the vector is the nonlinear term. The nonlinear function is evaluated componentwise at the components of the state variable . The complete space of is spanned by a set of orthonormal basis vectors . Since is assumed to be attracted to a certain low dimensional subspace , all the solutions of Eq. 1 could be expressed in terms of only basis vectors () that span . The solution could then be approximated as a linear combination of these basis vectors as:(2) 
where is the residual representing the part of the that is orthogonal to the subspace . Thus, the inner product of with any of the basis vectors that span is zero (i.e. ). The basis vectors of are collected in the matrix and is the timedependent coefficient vector. POD identifies the subspace
from the singular value decomposition (SVD) of a series of temporal snapshots of the full order system (Eq.
1) collected in the snapshot matrix , where is the number of different simulation runs (i.e. different initial conditions, different controls and/or different model parameters). The SVD of is computed as:(3) 
The orthonormal basis matrix for approximating is given by the first columns of the matrix . Substituting Eq. 2 into Eq. 1 while neglecting , one gets:
(4) 
By multiplying Eq. 4 with , one obtains POD based ROM defined by:
(5) 
where . We note that the PODGalerkin ROM (Eq. 5) is of reduced dimension and could be used to approximate the solution of the highdimensional full order model (Eq. 1). The computation of the first term in the right hand side of Eq. 5 involve operations in comparison to multiplications in the FOM. However, the nonlinear term cannot be simplified to an nonlinear evaluations. In the following subsection, we review the Discrete Empirical Interpolation Method (DEIM) which is aimed at approximating the nonlinear term in Eq. 5 using evaluations and thus rendering the solution procedure of the reduced order system independent of the highdimensional system size .
2.2 Deim
As outlined in the previous section, evaluating the nonlinear term in the PODGalerkin method is still an expensive computational step, as the inner products of the full highdimensional system is needed. The DEIM algorithm tries to reduce the complexity of evaluating the nonlinear terms in the POD based ROM (Eq. 5) by computing the nonlinear term only at carefully selected locations and interpolate everywhere else. The nonlinear term in Eq. 1 is approximated by a subspace spanned by an additional set of orthonormal basis represented as . More specifically, a lowrank representation of the nonlinearity is computed using singular value decomposition of a snapshot matrix of the nonlinear function resulting in:
(6) 
where, is the snapshot matrix of the nonlinear function evaluated using the sample solutions directly from the snapshot solution matrix defined in the previous section. The dimensional basis for optimally approximating is given by the first columns of the matrix , denoted by . The nonlinearity vector is then approximated as:
(7) 
where is similar to in Eq. 2. The idea behind DEIM is to make an optimal selection of rows in such that the original overdetermined system Eq. 7 is approximated by an invertible system with an error as small as possible. The selection procedure described in (Chaturantabut and Sorensen, 2010) is performed to determine the boolean matrix while making use of the orthonormal basis vectors . The columns of the boolean matrix are specific columns of
dimensional identity matrix
(Chaturantabut and Sorensen, 2010). Using , the basis interpolation of Eq. 7 can be made invertible and thus solvable for(8) 
Using this expression of , the approximate nonlinear term in Eq. 7 is formulated as:
(9) 
where is referred to as the DEIMmatrix. Due to the selection by , only components of the rightside are needed. In addition, for nonlinear dynamical systems, implicit time integration schemes are often used. This leads to a system of nonlinear equations that must be solved at each time step for example using Newton’s method. At each iteration, besides the nonlinear term , the Jacobian of the nonlinear term must also be computed with a computational cost depending on the full order dimension during the evaluation of the reduced Jacobian matrix defined by,
(10) 
Similar to the approximation of by DEIM method, the approximation for the reduced Jacobian of the nonlinear term using DEIM takes the form (Chaturantabut and Sorensen, 2010):
(11) 
In summary, by augmenting the standard POD formulation with DEIM, we can derive the PODDEIM reduced order model of the form:
(12) 
3 Review of standard RNN architectures
In this section, we briefly present the basic architecture of deep neural networks. Following that, we review standard architectures of recurrent neural networks and discuss its ability to approximate any dynamical system supported by universal approximation theorem. Then, we discuss the difficulties of training RNNs due to the vanishing gradient problem. Finally, we introduce the Long Short Term Memory (LSTM) architecture as a standard method to overcome the vanishing gradient problem in RNNs.
3.1 Deep Feedforward Neural Network
Artificial Neural Network (ANN) is a machine learning method that expresses the inputoutput relationship of the form:
(13) 
where , is the input variable,
is the target (output) variable,
is the predicted output variable obtained from ANN,is the activation function (the basis function) of the input variable,
is the transition weight matrix, is the output weight matrix and is an unknown error due to measurement or modeling errors (Bishop, 2006; Hornik et al., 1989). In the current notations, the bias terms are defined within the weight matrices by augmenting the input variable with a unit value (Hermans and Schrauwen, 2013). In Eq. 13, the target variable is modeled as a linear combination of same type of basis functions (i.e. sigmoid, perceptrons,
basis functions) parametrized by . Deep ANN of depth layers is a neural network architecture of the form:(14) 
where and are the elementwise nonlinear function and the weight matrix for the th layer and is the output weight matrix.
3.2 Standard Recurrent Neural Network
Recurrent Neural Network (RNN) is a neural network that has at least one feedback connection in addition to the feedforward connections (Pascanu et al., 2013b). The standard form of RNN is a discrete dynamical system of the from (Pascanu et al., 2013a):
(15)  
(16) 
where , is the input vector at time , is the activation function as defined in deep ANN and and are respectively the transition, input and output weight matrices of standard RNN. In Eq. 15, the hidden state is estimated based on the corresponding input and the hidden state at the previous time step. This delayed input () can be thought of as a memory for the artificial system modelled by RNN. The order of the dynamical system expressed by RNN is the number of hidden units i.e. the size of the hidden state vector (Hermans and Schrauwen, 2013). RNN can approximate state variables of any nonlinear difference equations as a linear combination of hidden state of standard RNN as in Eq. 16 supported by the universal approximation theorem:
Theorem 3.1 (Universal Approximation Theorem).
Similar to other supervised learning methods, ANN and RNN are calibrated using training data to find the optimal parameters (neuron weights) of the ANN or RNN. Given a set of training sequences:
the RNN parameters are fitted by minimizing the function:
(17) 
where known as mean square error (mse) is the average distance between the observed data and the RNN output across a number of samples with time dependent observations (Pascanu et al., 2013a). The set of parameters
could be estimated by backpropagating the gradient of the loss function
with respect to in time. This technique is commonly called Backpropagation Through Time (BPTT) (Werbos, 1990; Rumelhart et al., 1988; Pascanu et al., 2013b; Mikolov et al., 2014).Similar to deep learning Neural Network architectures, standard RNN has training difficulties especially in the presence of longterm dependencies due to the vanishing and exploding gradient (Pascanu et al., 2013b; Mikolov et al., 2014). The main reason for the vanishing gradient problem is the exponential dependency of the error function gradient with respect to the weight parameters and the repeated multiplication of error function due to the cyclic behaviour of RNN during BPTT. This repeated multiplication causes the gradient to vanish when the absolute values of weight parameters are less than one (Pascanu et al., 2013b; Mikolov et al., 2014).
3.3 Long Term Short Term Memory network
LSTM architecture (Hochreiter and Schmidhuber, 1997) was introduced to address the aforementioned vanishing gradient problem. The architecture of LSTM is of the form:
(18)  
where are the input, forget and output gates respectively, with sigmoid activation functions . These activation functions take the same inputs namely but utilize different weight matrices as denoted by the different subscripts. As the name implies, and act as gates to channelize the flow of information in the hidden layer. For example, the activation of gate in channelizing the flow of hidden state is done by multiplication of with the hidden state value (Lipton et al., 2015; De Mulder et al., 2015). Input gate and forget gate decides the proportion of hidden state’s internal memory and the proportion of respectively to update . Finally, the hidden state is computed by the activation of the output gate in channelizing flow of internal memory . If the LSTM has more than one hidden unit then the operator in Eq. 18 is an elementwise multiplication operator.
4 Physics driven Deep Residual RNN
General nonlinear dynamical systems (as formulated by Eq. 1) are often discretized using implicit time integration schemes to allow for large time steps exceeding the numerical stability constraints (Pletcher et al., 2012). This leads to a system of nonlinear residual equations depending on utilized the time stepping method. For example, the residual equation obtained from implicit Euler time integration scheme at time step takes the form:
(19) 
To be noted, the residual equation of ROM (Eq. 5 and Eq. 12) takes a similar form to Eq. 19 which is solved at each time step to minimze the residual using Newton’s method. In addition, performing parametric uncertainty propagation requires solving a large number of realizations, in which, each forward realization of the model may involve thousands of time steps, therefore, requiring to perform a very large number of nonlinear iterations. To alleviate this computational burden, we introduce a computationally efficient deep RNN architecture which we denote as deep residual recurrent neural network (DRRNN) to reflect the physics of the dynamical systems.
DRRNN iteratively minimize the residual equation (Eq. 19) at each time step by stacking network layers. The architecture of DRRNN is formulated as:
(20)  
where are the training parameters of DRRNN, is an activation function ( in the current study), the operator in Eq. 20 denotes an elementwise multiplication operator, is the residual in layer obtained by substituting into Eq. 19 and is an exponentially decaying squared norm of the residual defined as:
(21) 
where are fraction factors and is a smoothing term to avoid divisions by zero. In this formulation, we set . The DRRNN output at each time step is defined as:
(22) 
where is a weight matrix that could be optimized during the DRRNN training process. However, in all our numerical test cases , was excluded from the training process and is set as a constant identity matrix. The update equation for in Eq. 20
is inspired by the rmsprop algorithm
(Tieleman and Hinton, 2012) which is a variant of the steepest descent method. In rmsprop, the parameters are updated using the following equation:(23)  
where is an exponentially decaying average of the squared gradients of the loss function , is the fraction factor (usually ), is the constant learning rate parameter (usually ) and is a smoothing term to avoid divisions by zero. We note that in Eq. 23 is a vector and changes both the step length and the direction of the steepest decent update vector. However, in Eq. 20 is a scaler and changes only the step size to update in the direction of . Furthermore, we use as a stability factor in updating since the update scheme in DRRNN is explicit in time and may be prone to instability when using large time steps.
One of the main reasons to consider DRRNN as a low computational budget numerical emulator is the way the time sequence of the state variables is updated. The dynamics of DRRNN are explicit in time with a fixed computational budget of order per time step. Furthermore, DRRNN framework has a prospect of applying DRRNN to solve Eq. 19 on different levels of time step much larger than the time step taken in Eq. 19. In other words, DRRNN provides an effective way to solve Eq. 19 for a fixed time discretization error.
5 Numerical Results
In this section, we demonstrate two different applications of DRRNN as a model reduction technique. The first application concerns the use of DRRNN for reducing the computational complexity from to at each time step for nonlinear ODE systems without reducing the dimension of the state variable of the system. Moreover, DRRNN is allowed to take large time steps violating the numerical stability condition and is constrained to have time discretization error several times less than the order of large time step taken. We denote this reduction in computational complexity as temporal model reduction. The second application is focused on spatial dimensionality reduction of dynamical systems governed by a time dependent PDEs with parametric uncertainty. In this case, we use DRRNN to approximate a reduced order model derived using a PODGalerkin strategy.
In section 5.1, DRRNN is first demonstrated for temporal model order reduction. In addition, we provide a numerical comparison against ROM based on standard recurrent neural networks architectures. In section 5.2, we build DRRNN to approximate POD based reduce order model (Eq. 5) and compare the performance of DRRNN in approximating POD based ROM against the ROM based on the PODGalerkin and PODDEIM methods.
5.1 Temporal model reduction
In this section, we conduct temporal model reduction to evaluate the performance of DRRNN in comparison to the standard recurrent neural networks on three test problems. The standard recurrent neural networks used are RNN and LSTM denoted by RNN and LSTM respectively, where the subscript denotes the order of the recurrent neural network (number of neurons in the hidden layer). The DRRNN is denoted by DRRNN, where the subscript in this case denotes the number of residual layers. We also note that the order of DRRNN is same as the order of the given dynamical equation since we rely on using the exact expression of the system dynamics. In all test cases, we utilize a activation function in the standard RNN models.
All the numerical evaluations are performed using the keras framework (Chollet, 2015), a deep learning python package using Theano (Theano Development Team, 2016) library as a backend. Further, we train all RNN models using rmsprop algorithm (Tieleman and Hinton, 2012; Chollet, 2015) as implemented in keras with default settings. We set the weight matrix of DRRNN in Eq. 20 as a constant identity matrix and do not include it in the training process. The vector training parameter in Eq. 20
is initialized randomly from a zeromean Gaussian distribution with standard deviation fixed to 0.1. The scalar training parameters
in Eq. 20are initialized randomly from the uniform distribution
. We set the hyperparameters
and in Eq. 21 to and , respectively.Problem 1
We consider a nonlinear dynamical system of order defined by:
(24) 
with initial conditions . The input
is a random variable with a uniform distribution
. Modeling this dynamical system is particularly challenging as the response has a discontinuity at the planes and (Bilionis and Zabaras, 2012). Figure 1shows the jump discontinuities in the response
and versus the perturbations in the initial input . A standard backward Euler method is used for 100 time steps of size and we solve the problem for 1500 random samples of .We train RNN parameters using data obtained from 500 random samples and the remaining runs (i.e. 1000) are used for validation. The training is performed using a batch size of for iterations. We use 7 recurrent neural networks namely , , , , , and . The performances of all 7 RNNs is evaluated based on accuracy and model complexity. Accuracy is measured using the mean square error (Eq. 17
) for the training and the test data sets. Also, we show comparative plots of the probability density function (PDF) of the state variables at specific time steps. Model complexity is determined based on the number of parameters
fitted in each RNN model.Figure 2 compares the PDF of and computed from all 7 RNN against the reference PDF solution. The results presented in Figure 2 shows that the PDF obtained from DRRNN with residual layers closely follow the trend of the reference PDF. The mse of all RNN models and the corresponding model complexity are presented in Table 1. It is worth noticing that DRRNN models have fewer number of parameters and hence much lower model complexity than standard RNN models. Furthermore, Table 1 shows that DRRNN with residual layers is considerably better than the standard RNN in fitting the data both in the training and testing data sets. We argue that such performance is due to the iterative update of DRRNN output towards the desired output. However, the small differences among the models with residual layers indicates that the additional residual layers in DRRNN are not needed in this particular problem.
Model  

33  84  1093  4053  3  4  6  
mse train  23  15  21  15  2  4  4 
mse test  23  15  21  14  5  4  4 
We further train the DRRNN using data sampled at time interval larger than those used in the backward Euler numerical integrator. For example, we train using sampled data at resulting in 20 time samples instead of 100 time samples when using the original time step size . We analyse this experiment using DRRNN and DRRNN as the top performer in the last set of numerical experiments. Figure 3 shows the PDF of computed from DRRNN and DRRNN for different time step along with the PDF computed from the reference solution. As can be observed, the performance of DRRNN is superior to DRRNN supporting our argument on the hierarchical iterative update of the DRRNN solution as the number of residual layer increases. In Figure 3, DRRNN performed well for 2 times , while it results in large errors for 5 and 10 times whereas DRRNN performed well for all large time steps. Through this numerical experiment, we provide numerical evidence that DRRNN is numerically stable when approximating the discrete model of the true dynamical system for a range of large time steps with small discretization errors. However, there is a limit on the time step size for the desired accuracy in the output of the DRRNN and this limit is correlated to the number of utilized layers.
Problem 2
The dynamical equation for problem 2 is the same as in test problem 1. However, the initial conditions are set to where the stochastic dimension is increased from 1 to 2. The input random variables
are modeled by uniform probability distribution function
. We adopted the same procedure followed in problem 1 to evaluate the performances of the proposed DRRNN incomparison to the standard recurrent neural network models. Figure 4 shows a comparison of the PDF plot for and computed from all RNN models. Errors of all RNN models and the corresponding model complexity are presented in Table 2. We can see the performance trend of all RNN models observed in problem 2 are similar to the trends observed in Problem 1.Model  

33  84  1093  4053  3  4  6  
mse train  26  26  26  20  2  1  2 
mse test  26  26  26  20  2  1  3 
We follow the similar procedure adopted in problem 1 to analyze the performance of DRRNN in taking large time step. Figure 5 compares the PDF of computed from DRRNN and DRRNN for different large time steps with the PDF computed from the reference solution for the fine time step size. We observe similar performance trends of DRRNN and DRRNN to those observed in test problem 1 (Figure 3).
Problem 3
The dynamical system considered in this problem is similar to problem 1 and problem 2 with further additional difficulties in the initial conditions , where . Remarkably, problem 3 is rather difficult to train by RNN compared to problem 1 as the stochastic dimension in this problem is 3. We adopted the same procedure followed in problem 1 to evaluate the performances of the proposed DRRNN in comparison to the standard recurrent neural network models. Figure 6 shows the PDF of and computed from all RNN. Errors of all RNN models and their model complexity are presented in Table 3. Performance ranking of all 7 RNN models remain similar to Problem 1 and Problem 2 in spite of the increased stochastic dimension. More specifically, from Table 3, we notice a decreases in mse as the number of network layers in DRRNN increases.
Model  

33  84  1093  4053  3  4  6  
mse train  33  17  33  15  3  1  1 
mse test  33  17  33  15  4  5  1 
We carry out the same large time step performance analysis carried out in problem 1 and problem 2 for DRRNN and DRRNN. Figure 7 compares the PDF of using DRRNN and DRRNN for different large time step with the PDF computed from the reference solution using the fine time step size. One can notice the performance trend of DRRNN and DRRNN are nearly similar to the trend noticed in problem 1 and problem 2 (Figure 3 and Figure 5). From the results presented in Figure 7, we observe that DRRNN performs well for large time steps of 2, 5 times , however, it results in small errors in the PDF plot for the case of 10 times in this problem.
5.2 Dimensionality reduction in space
In this section, we evaluate the performance of DRRNN in spatial dimensional reduction by using DRRNN to approximate the ROM derived from PODGalerkin strategy. We compare DRRNN with POD based ROM and PODDEIM ROM to conduct two parametric uncertainty quantification problems involving time dependent partial differential equations.
In the following test cases, the weight matrix of DRRNN in Eq. 20 is initialized randomly from a uniform distribution function . The vector training parameter in Eq. 20 is initialized randomly from the white Gaussian distribution with its standard deviation fixed to 0.1. The scalar training parameters in Eq. 20 are initialized randomly from the uniform distribution . We set the hyperparameters , in Eq. 21 to respectively.
Problem 4
In this problem we model unsteady heat diffusion over the spatial domain using:
(25) 
where is the temperature field, is the random heat diffusion coefficient with uniform probability distribution function . The problem is imposed with homogeneous initial condition and Dirchelet boundary conditions and . The heat source takes the form:
(26) 
We use a finite difference discretization with a spatial step size . The discretized FOM is formulated as:
(27) 
with obtained using second order central difference stencil. The dimension of the problem is . The resulting system of ODEs (Eq. 27) is then solved by using standard implicit Euler method with a time step size for 40 time steps. We solve the problem for 500 random samples of . Further, a set of solution snapshots is collected to construct the POD basis by computing the following singular value decomposition
(28) 
where is the snapshot matrix of the sample solutions of Eq. 27, is the number of snapshots used in computing SVD. The space of is spanned by the orthonormal column vectors of matrix . The left panel in the Figure 8 shows the decay of singular values of the snapshot matrix . The optimal basis for approximating is given by the first columns of matrix denoted by and is used to reduce the FOM given by Eq. 27 to POD based ROM of the form:
(29) 
where and . Next, we solve Eq. 29 using standard implicit Euler method with a time step of size for 40 time steps using the same 500 random samples of used in FOM (Eq. 27). We solve Eq. 29 for a set of different number of POD basis functions (). Finally, we built DRRNN with four layers to approximate the ROM defined in Eq. 29. We train DRRNN using time snapshot solutions of Eq. 29 collected for some random samples of heat diffusion coefficient.
Figures 8 and 9 show the numerical solutions obtained from the FOM, the linear POD of dimension 15, and the DRRNN of dimension 15. The results plotted in the figures show that both the POD based ROM and the DRRNN with dimension 15 produce good approximations to the original fullorder system.
Figure 10 compare PDF of obtained from the reduced order models against the full order model PDF. The utilized ROMs use 5 POD basis functions in the left panel and 15 POD basis functions in the right panel. The results in Figure 10 shows that the PDF obtained by the reduced systems are indistinguishable from the PDF of the FOM, while using or POD basis. Figure 11 shows the mse defined in Eq. 17 for different number of POD basis obtained from the POD based ROM and the DRRNN. From the Figure 11, we can observe that the mse decreases with the increase in the number of POD basis due to the decay of singular values of the snapshot solution matrix . Although the results of DRRNN and POD based ROM are indistinguishable, we note that DRRNN is an explicit method with a computational complexity of while POD method uses an implicit time discretization scheme with a complexity nearly to , where is the number of time steps marched in the time domain and is the number of random samples.
Problem 5
In this problem, we are interested in modeling the fluid displacement within a porous media, where water is pumped to displace oil. Although the displacing fluid is assumed to be immiscible with the displaced fluid (oil), the displacement front does not take place as a piston like flow process with a sharp interface between the two fluids. Rather, simultaneous flow of the immiscible fluids takes place within the porous media (Chen et al., 2006). In this problem, we are mainly interested in the evolution of the saturation of the water phase. We solve a pair of partial differential equations namely the pressure and the saturation equations. A simplified onedimensional pressure equation takes the form (Chen et al., 2006):
(30) 
where is the pressure, is the permeability, is the total mobility, is the mass flow rate defined as and is the density. The subscript and denotes the water phase and the oil phase, respectively. The mobility term is defined as , where
is the viscosity and , are the relative permeability terms defined by the BrooksCorey model (Chen et al., 2006; Aarnes et al., 2007). The second equation is the saturation equation defined as (Chen et al., 2006):
(31) 
where is the Darcy velocity field, is the porosity and is the fractional flow function. We complete the model description by defining the phase relative permeabilities as a function of saturation using BrooksCorey model (Chen et al., 2006):
(32) 
where is the residual oil saturation, is the residual water saturation and is the saturation value. In this simplified model, we assume a constant porosity throughout the media and we neglect the effects of compressibility, capillary, and gravity. We complete the description of the problem by the following input data:
(33)  
The initial condition of is uniform and is equal to and we use no flow boundary condition. We adopt a sequantial implicit solution strategy (Chen et al., 2006; Aarnes et al., 2007) to compute the numerical solution of Eq. 30 and Eq. 31. In this method, a sequential updating of the velocity field and saturation is performed where each equation is treated separately. The first step is to solve for the pressure and the velocity field at an initial time. Then, with this velocity field and initial saturation, the saturation is evolved over a small number of time steps with the velocity field kept constant. The resulting saturation is then used to update the pressure and velocity. The process is repeated until the time of interest. We use a simple finite volume method (FVM) for spatial discretization with first order upwind scheme as it is a conservative method. The discretized form of the FOM of Eq. 31 is formulated as:
(34) 
This equation is then discretized in time using backward Euler method. In space, we use 64 spatial grid points over the domain and in time we use a time step of size for 100 time steps. Newton Raphson iteration is used to solve the resulting system of nonlinear equations to evolve the saturation at each time step. The uncertainty parameter in this test problem is the porosity value with a uniform probability distribution function . We solve the FOM (Eq. 34) by using standard implicit Euler method for 500 random samples of . It is interesting fact to note that constant violates Von Neumann stability condition given by
(35) 
where is numerical grid size (Pletcher et al., 2012; Thomas, 2013; Aarnes et al., 2007). Next, the POD basis vectors are constructed from the solutions of the fullorder system taken from the collected set of snapshot solutions. This is done by computing the following singular value decomposition
(36)  
where is the snapshot matrix of saturation and is the snapshot matrix of nonlinear function , is the dimension of and , are the number of snapshots used in computing SVD for the saturation and the nonlinear flow function respectively. The space of saturation is spanned by the orthonormal column vectors of matrix and the space of nonlinear function is spanned by the orthonormal column vectors in the matrix . The optimal basis for approximating is given by the first columns of the matrix denoted by and is used to reduce FOM to POD based ROM of the form:
(37) 
where , and forms the bottleneck that has to be reduced with DEIM as detailed in Section 2.2. Application of the DEIM algorithm (section 2.2) on nonlinear POD based ROM (Eq. 37) results in PODDEIM reduced order model of the form:
(38) 
where , referred as DEIMmatrix in Eq. 9 (section 2.2),
is the orthogonal matrix for optimally approximating
given by the first columns of the matrix . Figure 12 shows the decay of singular values of the snapshot matrix and of the nonlinear function snapshot matrix . Next, we solve Eq. 37 and Eq. 38 by using standard implicit Euler method with a time step of for 100 time steps using the same 500 random samples of used in FOM (Eq. 34). We solve Eq. 37 for a set of POD basis functions () and similarly, we solve Eq. 38 for the same set of POD basis functions using a DEIM basis functions of fixed number (). Further, we built DRRNN to approximate the PODDEIM ROM (Eq. 38) where we apply DEIM in the DRRNN to evaluate the nonlinearity, which gives an important speedup in the efficiency of the formulation. We train DRRNN using time snapshot solutions of Eq. 37 collected for some random samples of porosity values.Figure 13 compares the kernel density estimated probability density function (PDF) obtained from all ROMs to the PDF obtained from the FOM. Figure 14 compares the numerical solutions obtained from all the reduced order models to the numerical solutions obtained from the FOM. In these figures, ROM uses 15 POD basis functions in the left panel and 35 POD basis functions in the right panel. From these figures, when the POD basis of dimension 35 is used, the numerical solutions of the reduced systems from all approaches appear to be indistinguishable from the numerical solution of the FOM. We note that the saturation equation has a hyperbolic structure which is more complicated to capture, especially in the nonlinear function.
Figure 15 shows the mse defined in Eq. 17 at different number of POD basis obtained from all ROMs. From Figure 15, we can observe a decrease in mse as we increase the number of POD basis which is attributed to the decay of singular values of the snapshot solution matrix . Although the errors from the POD reduced system is slightly lower than the errors arising from applying PODDEIM and DRRNN, the complexity in the online computation of the nonlinear function still depends on the dimension of the original fullorder system. Moreover, it is necessary to compute the Jacobian matrix of full dimension at every Newton iteration and at every time step in POD based ROM (Chaturantabut and Sorensen, 2010). Despite the fact that PODDEIM approach not only gives an accurate reduced system with reduced computational complexity by removing the dependency on the dimension of the original fullorder system with the general nonlinearities, PODDEIM relies on evaluating the Jacobian matrix at every Newton iteration and results in a computational complexity in order , where is the number of Newton iterations. The presented numerical results showed that both PODDEIM and DRRNN approaches can be used to construct an accurate reduced system. However, DRRNN constructs an accurate reduced system without evaluating the Jacobian matrix (as an explicit method) and thus limiting the computational complexity to instead of , where is the number of stacked network layers and is the number of Newton iterations.
6 Conclusions
In this paper, we introduce a Deep Residual Recurrent Neural Network (DRRNN) as an efficient model reduction technique that accounts for the dynamics of the full order physical system. We then present a new model order reduction for nonlinear dynamical systems using DRRNN in combination with POD based MOR techniques. We demonstrate two different applications of the developed DRRNN to reduce the computational complexity of the fullorder system in different computational examples involving parametric uncertainty quantification. Our first application concerns the use of DRRNN for reducing the computational complexity from to for nonlinear ODE systems. In this context, we evaluate DRRNN in emulating nonlinear dynamical systems using a different number of large time step sizes violating the numerical stability condition for a small time discretization errors. The presented results show an increased accuracy of DRRNN as the number of residual layer increases based on the hierarchical iterative update scheme.
The second application of DRRNN is related to spatial dimensionality reduction of dynamical systems governed by time dependent PDEs with parametric uncertainty. In this context, we use DRRNN to approximate ROM derived from a PODGalerkin strategy. For the nonlinear case, we combined POD with the DEIM algorithm for approximating the nonlinear function. The developed DRRNN provides a significant reduction of the computational complexity of the extracted ROM limiting the computational complexity to instead of per time step for the nonlinear PODDEIM method, where is the number of stacked network layers in DRRNN and is the number of Newton iterations in PODEIM.
References
 Aarnes et al. (2007) Jørg E Aarnes, Tore Gimse, and KnutAndreas Lie. An introduction to the numerics of flow in porous media using matlab. In Geometric modelling, numerical simulation, and optimization, pages 265–306. Springer, 2007.
 BailerJones et al. (1998) Coryn AL BailerJones, David JC MacKay, and Philip J Withers. A recurrent neural network for modelling dynamical systems. network: computation in neural systems, 9(4):531–547, 1998.
 Barrault et al. (2004) Maxime Barrault, Yvon Maday, Ngoc Cuong Nguyen, and Anthony T. Patera. An empirical interpolation method: application to efficient reducedbasis discretization of partial differential equations. Comptes Rendus Mathematique, 339(9):667 – 672, 2004. ISSN 1631073X. doi: http://dx.doi.org/10.1016/j.crma.2004.08.006.
 Berkooz et al. (1993) Gal Berkooz, Philip Holmes, and John L Lumley. The proper orthogonal decomposition in the analysis of turbulent flows. Annual review of fluid mechanics, 25(1):539–575, 1993.
 Bertsekas (1999) Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 Bilionis and Zabaras (2012) Ilias Bilionis and Nicholas Zabaras. Multioutput local gaussian process regression: Applications to uncertainty quantification. Journal of Computational Physics, 231(17):5718–5746, 2012.
 Bishop (2006) Christopher M Bishop. Pattern recognition. Machine Learning, 2006.
 Bullinaria (2013) John A Bullinaria. Recurrent neural networks. Neural Computation: Lecture, 12, 2013.
 Chaturantabut and Sorensen (2010) Saifon Chaturantabut and Danny C Sorensen. Nonlinear model reduction via discrete empirical interpolation. SIAM Journal on Scientific Computing, 32(5):2737–2764, 2010.
 Chen et al. (2006) Zhangxin Chen, Guanren Huan, and Yuanle Ma. Computational methods for multiphase flows in porous media. SIAM, 2006.
 Chollet (2015) François Chollet. Keras. https://github.com/fchollet/keras, 2015.
 De Mulder et al. (2015) Wim De Mulder, Steven Bethard, and MarieFrancine Moens. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language, 30(1):61–98, 2015.
 Elsheikh et al. (2012) Ahmed H. Elsheikh, Matthew Jackson, and Tara Laforce. Bayesian reservoir history matching considering model and parameter uncertainties. Mathematical Geosciences, 44(5):515–543, Jul 2012. ISSN 18748953. URL https://doi.org/10.1007/s1100401293972.
 Elsheikh et al. (2013) Ahmed H. Elsheikh, Mary F. Wheeler, and Ibrahim Hoteit. Nested sampling algorithm for subsurface flow model selection, uncertainty quantification, and nonlinear calibration. Water Resources Research, 49(12):8383–8399, 2013. ISSN 19447973. URL http://dx.doi.org/10.1002/2012WR013406.
 Frangos et al. (2010) M. Frangos, Y. Marzouk, K. Willcox, and B. van Bloemen Waanders. Surrogate and ReducedOrder Modeling: A Comparison of Approaches for LargeScale Statistical Inverse Problems, pages 123–149. John Wiley & Sons, Ltd, 2010. ISBN 9780470685853. doi: 10.1002/9780470685853.ch7. URL http://dx.doi.org/10.1002/9780470685853.ch7.
 Funahashi and Nakamura (1993) Kenichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks, 6(6):801–806, 1993.
 Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 Hermans and Schrauwen (2013) Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems, pages 190–198, 2013.
 Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 Koziel and Leifsson (2013) Slawomir Koziel and Leifur Leifsson. Surrogatebased modeling and optimization. Applications in Engineering, 2013.
 Lassila et al. (2014) Toni Lassila, Andrea Manzoni, Alfio Quarteroni, and Gianluigi Rozza. Model order reduction in fluid dynamics: challenges and perspectives. In Reduced Order Methods for modeling and computational reduction, pages 235–273. Springer, 2014.
 Lipton et al. (2015) Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
 Mikolov et al. (2014) Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc’Aurelio Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.
 Pascanu et al. (2013a) Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013a.
 Pascanu et al. (2013b) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML (3), 28:1310–1318, 2013b.
 Petvipusit et al. (2014) Kurt R. Petvipusit, Ahmed H. Elsheikh, Tara C. Laforce, Peter R. King, and Martin J. Blunt. Robust optimisation of CO2 sequestration strategies under geological uncertainty using adaptive sparse grid surrogates. Computational Geosciences, 18(5):763–778, Oct 2014. ISSN 15731499. URL https://doi.org/10.1007/s105960149425z.
 Pletcher et al. (2012) Richard H Pletcher, John C Tannehill, and Dale Anderson. Computational fluid mechanics and heat transfer. CRC Press, 2012.
 Rewienski and White (2003) Michal Rewienski and Jacob White. A trajectory piecewiselinear approach to model order reduction and fast simulation of nonlinear circuits and micromachined devices. IEEE Transactions on computeraided design of integrated circuits and systems, 22(2):155–170, 2003.
 Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1, 1988.
 Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, pages 2171–2180, 2015.
 Theano Development Team (2016) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
 Thomas (2013) James William Thomas. Numerical partial differential equations: conservation laws and elliptic equations, volume 33. Springer Science & Business Media, 2013.
 Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 2012.
 Werbos (1990) Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 Willcox (2006) Karen Willcox. Unsteady flow sensing and estimation via the gappy proper orthogonal decomposition. Computers & fluids, 35(2):208–226, 2006.

Winter and Breitsamter (2014)
M Winter and C Breitsamter.
Reducedorder modeling of unsteady aerodynamic loads using radial basis function neural networks
. Deutsche Gesellschaft für Luftund RaumfahrtLilienthalOberth eV, 2014.  Zimmermann et al. (2012) HansGeorg Zimmermann, Christoph Tietz, and Ralph Grothmann. Forecasting with recurrent neural networks: 12 tricks. In Neural Networks: Tricks of the Trade, pages 687–707. Springer, 2012.
Comments
There are no comments yet.