In mathematics and computing, the Levenberg-Marquardt algorithm (LMA) provides a numerical solution to the problem of minimizing a function, generally non-linear, over a space of parameters of the function. These minimization problems arise especially in least squares curve fitting and non-linear programming. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as Gauss-Newton using a trust region approach. The LMA is a very popular curve-fitting algorithm used in many software applications for solving generic curve-fitting problems. However, the LMA finds only a local minimum, not a global minimum.
The used algorithm is known as the Levenberg-Marquardt algorithm (LMA) that goes back to 1943 as Kenneth Levenberg presented his work to the Mathematical Society at their annual meeting (Levenberg, K. The Quarterly of Applied Mathematics 2, 164 (1944)). Marquardt on the other hand popularized the method by distributing his FORTRAN code for free. In the following section a brief overview is given of the Levenberg-Marquardt algorithm and its implementation in iMODFLOW. Most of the following LMA implementation has been based on the paper of Olsthoorn (1995), Effective Parameter Optimization for Ground-Water Model Calibration, Groundwater, Vol. 33, no.1) and the paper of Knorr BM (2011), The Levenberg-Marquardt Algorithm, Seminar in Modern Physics, Summer 2011 and the PEST Manual of [Doherty(2010)].
The core of parameter estimation is the minimization of some error criterion, cost or objective function \(\Phi (\textbf {p})\), that depends on a parameter vector p with elements \(p_i \rightarrow i=1,N_p\) where \(N_p\) denotes the number of unknowns to be optimized (i.e. the amount of parameters). In general the objective function \(\Phi _{\rm m}(\textbf {p})\) is the sum of squares sum of the individual errors notated as:
\begin{align*} \Phi \left (\textbf {p}\right )=\left [\textbf {y}-\phi \left (\textbf {p}\right )\right ]^{\rm T}\textbf {Q}_h\left [\textbf {y}-\phi \left (\textbf {p}\right )\right ] \end{align*}
where y are the measurements with elements \(y_i \rightarrow i=1,N_h\); where \(N_h\) denotes the number of observations; \(\phi (\textbf {p})\) are the computed head for the parameters defined in p and Q is the weight matrix assigned to the observations defined as:
\begin{align*} \textbf {Q}_h[i,i]=\frac {1}{\sigma ^2_i} \end{align*}
where \(\textbf {Q}_h[i,i]\) is a square matrix with the weight on its main diagonal for each \(i^{\rm th}\) observation. The variance \(\sigma ^2_i\) is the squared standard deviation \(\sigma _i\) that measures the amount of variation from the average. A low \(\sigma _i\) indicates that an particular observation \(y_i\) should be able to the meet the corresponding computed head \(\phi _i\) more closely that observations with higher values of variations \(\sigma \). It is possible to specify weight values \(\textbf {Q}_h[i,i]\) or standard deviation \(\sigma _i\) in iPEST. For the latter iMOD computes weight values as \(\sigma ^{-2}\), internally.
The Levenberg-Marquardt Algorithm (LMA) is applied to minimize the objective function value, in other words it tries to find the location in parameter space where the gradient is zero, or near zero. Mathematically speaking, this point is where the derivative of the objective function towards the parameters is zero- or lowest, thus:
\begin{align*} \min (\Phi ) = \frac {\partial \phi (\textbf {p})}{\partial \textbf {p}} = \textbf {J} \approx 0.0, \end{align*}
where J is so-called the Jacobian matrix. It says, that whenever the sensitivity is 0.0, no improvement can be found of the objective function and the optimal solution has been found. For most of the model, this Jacobian cannot be derived from the model and need to be estimated. This is done by finite-differences where small perturbations of parameter values (\(\Delta \) p) introduce a response in the state-vector (\(\phi \)). These responses are stored in the Jacobian matrix J as:
\begin{align*} \textbf {J} =\frac {\partial \phi (\textbf {p})}{\partial \textbf {p} } \approx \frac {\phi (\textbf {p})-(\phi (\textbf {p}+\Delta \textbf {p})}{\Delta \textbf {p}} \end{align*}
where the Jacobian matrix J has dimensions \(N_h \times N_p\) (number of observations row wise and number of parameters column wise). It represents the sensitivity for each observation point to a small perturbation \(\Delta \textbf {p}\) to the parameters p\(_t\). The Jacobian need to be recomputed for each cycle \(t\) which makes this optimization algorithm inefficient whenever a larger number of parameters is involved. iMOD contains an option to compute those sensitivities in parallel (iPESTP=1).
The minimal achievable value is achieved by adjusting values for the parameter vector \(\textbf {p}\) until the optimal (lowest objective function value) is found. This technique can never guarantee a global minimum, to qualify the uniqueness of the solution the regularisation (adding more constraints) can be applied. In order to arrive at a valid minimal objective function value, it is advisable to let \(N_h\) substantially larger than the number of parameters \(N_p\) (Yeh and Yoon (1981), Aquifer parameter identification with optimum dimension in parameterization, Wat.Res.Res, v17, no. 3, pp. 664-672) or apply regularisation as done by the Pilot Point concept (Doherty, 2003), see subsection 12.33.5. It is also necessary to perform the parameter optimization in log-space. Here fore, parameters values are transformed into \(\log 10\)-space and transformed back to normal space whenever they are inserted in the model.
The Gradient Descent Method (Steepest Descent Method) approaches the minimum of the objective function \(\min (\Phi )\) by adjusting each parameter according to:
\begin{align*} \textbf {p}_{t+1} &=\textbf {p}_t - \zeta _t \textbf {J}_t^{\rm T}\textbf {Q}_h \left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ], \end{align*}
where u\(_t\) is the update vector for all parameters, the subscript t denotes the sequential parameter iteration (parameter update cycle). It says that for each cycle \(t\), the individual parameter sensitivities (J) multiplied with the current residual (\(\left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ]\)) contributes to a reduction on the objective function. As the gradient need to go downwards, the "-"-sign is added. The \(\zeta _t \) is a weighting factor for the \(t^{\rm th}\) cycle, that determines how large the modification step for the current set of parameters is. The Gradient Descent Method takes large steps in this space of the residual surface that has small sensitivities and it takes small steps for those areas with large gradients. It normally leads to zigzagging in long narrow valleys on the \(\Phi \) residual surface, which decrease an efficient minimization progress. The Gauss-Newton method replaces the scaling factor \(\zeta _t\) by the inverse of the curvature (often called the HESSIAN) of the sensitivity surface and interchanges the undesired behaviour of the Gradient Descent Method. It therefore converges faster, so:
\begin{align*} \textbf {p}_{t+1} =\textbf {p}_t - \left [ \frac {\partial \phi (\textbf {p})}{\partial \textbf {p}} \right ]_t^{-1} \textbf {J}_t^{\rm T}\textbf {Q}_h \left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ] \end{align*}
The algorithm assumes linearity over the interval \(\partial \textbf {p}\), the second derivative \(\frac {\partial \phi (\textbf {p})}{\partial \textbf {p}}\) is approximated by \(\textbf {J}^{\rm T}\textbf {Q}_h\textbf {J}\) in the neighbourhood of \(\textbf {p}_t\), thus:
\begin{align*} \textbf {p}_{t+1} =\textbf {p}_t - \left [ \textbf {J}^{\rm T}\textbf {Q}_h\textbf {J} \right ]_t^{-1} \textbf {J}_t^{\rm T}\textbf {Q}_h \left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ] \end{align*}
In fact the Hessian describes the derivative of the derivative (J) and represents the rate of change of the slope and therefore the curvature of the sensitivity. If parameters are far from their optimal values, which they are probably initially, this \(\textbf {J}^{\rm T}\textbf {QJ}\) is a crude approximation of the true Hessian matrix. As a result the parameter update might be quite wrong which can result in a failure to converge. The great insight of Levenberg was to simply combine the Steepest Descent- and the Gauss-Newton Method to include a damping factor \(\lambda \) (Lagrange Multiplier) which determines how much of the Steepest Descent- or Gauss-Newton Method to include, so:
\begin{align*} \textbf {p}_{t+1} =\textbf {p}_t - \left [ \textbf {J}^{\rm T}\textbf {Q}_h\textbf {J} + \lambda \textbf {I}\right ]_t^{-1} \textbf {J}_t^{\rm T}\textbf {Q}_h \left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ] \end{align*}
where I is the identity matrix. Whenever \(\lambda _t\) is large, the parameter update will be determined more by the Steepest Descent Method and whenever \(\lambda _t\) is small, the Gauss-Newton Method prevails, significantly. An initial value for \(\lambda _0\) is estimated by:
\begin{align*} \lambda _0 = 10^{\rm Floor \left (\log 10 \frac {\Phi }{2N_h} \right )}. \end{align*}
Enduring the optimization the \(\lambda \) is adjusted, automatically. This is done in two stages:
1. Trust Hyper Sphere
Olsthoorn (1994) suggested to adjust \(\lambda \) such that the yielding parameter update \(\textbf {p}_{t+1}\) is within a certain trust hyper sphere (with a radius equal to a predefined maximal update step size per parameter). All parameters are scaled accordingly to ensure the direction of the gradient;
2. Parameter Boundaries
whenever parameters exceed their upper and or lower bounds they are excluded for the current parameter update cycle and remain at their boundaries. However, each cycle, this is re-examined as parameters might be moving inside the allowed parameter domain during the optimization process;
In the figure below it is demonstrated how the trusted hyper sphere affects the behaviour of convergence to the minimum of the \(\Phi \)-surface. In practice, small trust hyper spheres will yield more iterations than large trust hyper spheres; however, the latter can zigzag from one side of the valley onto the other side.
Whenever a suitable value for \(\lambda _t\) has been found, several additional \(\lambda ^i_t\) are computed as, e.g.
\begin{align*} \textbf {p}_{t+1}(\alpha \lambda _t )~;~\alpha ~\in ~\{0.1,1.0,10.0\}, \end{align*}
in which \(\alpha \) are multiplication factors that are by default equal to 0.1,1.0 and 10.0. However, more can be specified in iPESTP (use the keywords NLINESEARCH and TEST_LAMBDA). The objective function \(\Phi \) is computed for all of these \(\textbf {p}_{t+1}^i\) and the one that has the lowest \(\Phi \) is selected to proceed to the next iteration cycle. This makes the optimization very robust and less sensitive for capturing in steep and narrow valley of the \(\Phi \)-plane, prematurely. In the situation that none of the \(\lambda _t^i\) yield a lowering of the objective function, \(\lambda _t\) is multiplied with a factor \(\gamma =4\) and the \(\lambda \)-testing is carried out again.
Identifiability of the parameters to be optimized is a prerequisite of model calibration. This means that a unique set of parameters values \(\textbf {p}\) yields a unique head \(\phi \) field. In fact, all information for identifiability of the parameters is contained in the Jacobian matrix of the model at the optimum values of the parameters.
\begin{align*} \textbf {J}=\left [\frac {\partial \phi }{\partial p_i},\frac {\partial \phi }{\partial p_{i+1}},\cdots ,\frac {\partial \phi }{\partial p_{n_p}}\right ] \end{align*}
The Jacobian reveal the observability of the parameters by virtue of its rows and columns. Each row expresses the sensitivity of a single observation with respect the set of parameters. Each column expresses the sensitivity of all observations with respect to a single parameter. Some fundamental studies have been carried out which shed light of the inverse problem with respect to the identifiability parameters (Dietrich, 1990, Speed and Ahlfeld, 1996). These are based on the Singular Value Decomposition (SVD for (non-)square matrices) or Eigenvalue Decomposition (square matrix) of the sensitivity matrix:
\begin{align*} \left (\textbf {J}^{\rm T}\textbf {Q}\textbf {J}+\lambda _t\textbf {I}\right ) &= \textbf {V}\Lambda ^{\rm e}\textbf {V}^{-1} \\ \left (\textbf {J}^{\rm T}\textbf {Q}\textbf {J}+\lambda _t\textbf {I}\right )\textbf {V} &= \textbf {V}\Lambda ^{\rm e}. \end{align*}
where \(\Lambda ^{\rm e}\) contains the ordered eigenvalues (singular values are \(\sqrt {\Lambda ^{\rm e}}\)) and \(\textbf {V}\) are the eigenvectors. The eigenvectors representing the principal axis of the sensitivity matrix; the eigenvalues represent a measure for the relative ‘energy’ associated with a corresponding eigenvector. Particularly the eigenvalue of the Hessian describes the size of the curvature. The ratio between the first and last eigenvalues is the condition number \(\kappa =\log 10{\lambda ^{\rm e}_1/\lambda ^{\rm e}_n}\). The size of the condition number determines the shape of the minimum in the residual surface. Whenever the condition number is small, the parameters are identifiable, they all contribute significantly and unique to the residual surface. An informal rule of thumb is that if the condition number \(\kappa =15\), multicollinearity is a concern (multicollinearity or collinearity is a statistical phenomenon in which two or more variables in a multiple regression model are highly correlated); if it is greater than \(\kappa > 30\) multicollinearity is a very serious concern. The relative importance \(\varphi \) of the eigenvalues \(\lambda ^{\rm e}\) is given by:
\begin{align*} \varphi _i = \frac {\lambda _i^{\rm e}}{\sum _{j=1}^{N_p}\lambda _i^{\rm e}}\cdot 100.0\%~;i~\in ~\{1,\cdots ,N_p\}. \end{align*}
Whenever the eigenvalues decomposition leads to relative eigenvalues of \(\varphi =0.0\%\), it means that the sensitivity matrix \(\textbf {J}^{\rm T}\textbf {Q}\textbf {J}+\lambda _t\textbf {I}\) is singular; in that case the determinant is zero and there is no unique solution. Whenever this occurs, it means that the current rank of the sensitivity matrix is less than the actual dimensions of the sensitivity matrix, in other words, there is a redundancy in the data. As a consequent it is impossible to generate a unique solution as there is a true linear dependency in the data. Generally, this is valid for very small eigenvalues as well and they might blow up small errors in the observations and cause large error in (some of) the estimated parameters.
They great value of the eigenvalue decomposition is that it reduces the dimensions of the inverse problem, given the model and the data to dimensions that are identifiable. Directions in the parameter space that add a very limited amount of significance are quantified by low values for the associated eigenvalues. Any parameter that is outside the numerical space of the eigenvectors (V), so called null-spaces, are ignored. Any parameter can only be adjusted inside the numerical space as defined by the eigenvectors and avoid them to move into the null-space. \(\Lambda ^{\rm e}\). This is done by projecting the Levenberg-Marquardt into the limited parameter space as defined by the eigenvectors (reduced space). Here for, the number of eigenvectors are selected until a specified amount of relative important is reached.
\begin{align*} \textbf {G}_t &= \{\textbf {v}_1,\textbf {v}_2,\cdots ,\textbf {v}_n \} \end{align*}
This span is orthonormal, which means that all patterns are perpendicular to each other and have a unit length (\(\textbf {I} =\textbf {G}^{\rm T}\textbf {G} \)). It can be interpreted as a description or span of a high-dimensional coordinate system that is defined by a limited amount of axes. It need to be recomputed for each cycle \(t\) in the optimization process as the parameters might describe different regions in parameter space. The update if computed in reduced space as:
\begin{align*} \textbf {u}_t = - \left \{\textbf {G}^{\rm T} \left [\textbf {J}^{\rm T}\textbf {Q}_h\textbf {J} + \lambda \textbf {I}\right ]\textbf {G}\right \}_t^{-1} \textbf {G}^{\rm T} \left \{\textbf {J}_t^{\rm T}\textbf {Q}_h \left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ]\right \}, \end{align*}
where u is de parameter adjustment vector in reduced space. The inverse of the projected sensitivity matrix need to be solved in reduced space, which is more efficient. To obtain the parameter adjustment in parameter space, the vector u need to be project back with:
\begin{align*} \textbf {p}_{t+1} =\textbf {p}_t + \left [\textbf {Gu}\right ]_t \end{align*}
This parameter update is activated with the option PE_SCALING=3. It is, for above mentioned reasons, advised to apply the eigenvalue decomposition to enforce a numerically more sound parameter update.
Regularisation is a general term for condition to restrict more and increase uniqueness of the minimization problem. The general formulation of the objective function restricts parameters only by their minimal and maximal sustained values. As long as the parameter is not limited by those boundaries, the solution is only constraint by the sum of squared residuals. Uniqueness is in that case less profound. The modified standard regularized Levenberg-Marquardt objective function is:
\begin{align*} \Phi \left (\textbf {p}\right )=\left [\textbf {y}-\phi \left (\textbf {p}\right )\right ]^{\rm T}\textbf {Q}_h\left [\textbf {y}-\phi \left (\textbf {p}\right )\right ]+ \left [\textbf {p}-\textbf {p}_0\right ]^{\rm T} \kappa \textbf {Q}_p\left [\textbf {p}-\textbf {p}_0\right ] \end{align*}
where \(\kappa \) is a scaling factor to balance the \(\textbf {Q}_h\) and \(\textbf {Q}_p\) as \(\kappa = \frac {N_h}{N_p}\). As the ration between the number of observations raise the number of parameters, significantly, the weight of the parameters is increased to balance more evenly. Furthermore, \(\textbf {p}_0\) is the prior estimated parameter value and \(\textbf {Q}_p\) is the prior parameter weight matrix and defined as:
\begin{align*} \textbf {Q}_p[i,i]=\frac {1}{\vartheta _i^2}, \end{align*}
where \(\vartheta _i\) is the standard deviation of the parameter \(i\). As the parameter optimisation is carried out in log\(^{10}\)-space, the standard deviation is estimated by:
\begin{align*} \vartheta = \frac {1}{4}\left [{\rm log}^{10}({p_{\rm max}})-{\rm log}^{10}(p_{\rm min})\right ]. \end{align*}
Whenever the bandwidth of a parameter adjustment is in between 0.01 and 100.0, it yields a \(\vartheta =1.0\) and a \(\textbf {Q}[i,i]=1.0\). Lowering the bandwidth to 0.1 and 10.0 increases to be \(\vartheta =\frac {1}{4}\) and \(\textbf {Q}[i,i]=4.0\). The reduced parameter update becomes eventually:
\begin{align*} \textbf {u}_t = - \left \{\textbf {G}^{\rm T} \left [\textbf {J}^{\rm T}\textbf {Q}_h\textbf {J} + \lambda \textbf {Q}_p\right ]\textbf {G}\right \}_t^{-1} \textbf {G}^{\rm T} \left \{\textbf {J}_t^{\rm T}\textbf {Q}_h \left [\textbf {y}-\phi \left (\textbf {p}_t\right )\right ]+\kappa \textbf {Q}_p\left [\textbf {p}_t-\textbf {p}_0\right ]\right \}. \end{align*}
Marquardt replaced the I within the Levenberg formulation by the prior parameter weight matrix \(\textbf {Q}_p\). In this way, as \(\lambda \) grows, and the Gauss-Newton update step is modified to be more-and-more aligned with the steepest-descent step - and most important - the direction of the gradient is dominated by parameters with the lowest variance in uncertainty (most certain). The off-diagonals of the prior parameter weight matrix \(\textbf {Q}_p\) are all zero as these are representing the relationship from one parameter towards another. These are often unknown, except in combination with Pilot Points (see section 12.33.5) where the off-diagonals of the covariance matrix is constructed from the Kriging semivariogram.
The automatic parameter estimation generates valuable statistics and error measures.
How the uncertainty of the parameters propagates into the head uncertainty is given by the posterior parameter covariance matrix \(\textbf {C}^{\rm p}\). This is computed as the total objective function value \(\Phi \) divided by the degrees of freedom, so:
\begin{align*} \textbf {C}^{\rm p}&=\frac {\Phi }{N_h-N_p}\left (\textbf {J}^{\rm T}\textbf {Q}\textbf {J}\right )^{-1} \end{align*}
the entries of the diagonal elements of this parameter covariance matrix are the squared standard deviations \(\sigma \) of the uncertainties of the parameters \(i\):
\begin{align*} \sigma ^{\rm p}[i]=\sqrt {C^{\rm p}[i,i]}. \end{align*}
where \(\sigma ^{\rm p}_i\) is the so-called standard error of the parameters. As for normal distributions, the true parameter value \(p_i\) might be for 96% certainty in between:
\begin{align*} p[i]-1.96 \times \sigma ^{\rm p}[i]\leq p[i] \geq p[i]+ 1.96 \times \sigma ^{\rm p}[i]. \end{align*}
This standard parameter error \(\sigma ^{\rm p}\) is a measure of how unexplained variability in the data propagates to variability in the parameters, and is essentially an error measure for the parameters. The variance indicates the range over which a parameter value could extend without affecting model fit too adversely.
Moreover, from this parameter covariance matrix the correlation coefficients can be computed as:
\begin{align*} R^{\rm p}[i,j]=\frac {C^{\rm p}[i,j]}{\sqrt {C^{\rm p}[i,i]\cdot C^{\rm p}[j,j]}}. \end{align*}
The correlation matrix shows those parameters that are highly correlated whenever they have correlation factors of \(>0.90\) or \(<-0.90\). This means that whenever it appears that parameter A would be larger in reality, this also will be the case for parameter B. In other words, one can be linearly predicted from the others with a non-trivial degree of accuracy.
The possibility that a parameter estimation problem runs smoothly decreases with the number of parameters. In highly parameterized problems some parameters are likely to be more sensitive in comparison with others. As a result, the Levenberg-Marquardt algorithm may decide that large adjustments are required for their values if they are to make any significant contribution to reducing the objective function F(p). However, limits are set on parameter changes, such that the magnitude (but not the direction) of the parameter update vector is reduced. If a parameter is particularly insensitive compared to others, it may denominate the parameter update vector, yielding a large update vector. This need to be trimmed to fit the limits of the parameter update and as a result the update for other parameters might not change much at all, with a result that the objective function might be not reduced significantly at all. The result is that the convergence takes place intolerable slowly (or not at all), with a huge wastage of model runs. The relative sensitivity for a parameter is computed by:
\begin{align*} s_[i] =\frac {m^{-1} \sum \limits _{j=1}^{m} w_[j] j_[ij]}{\sum s_[i] } \cdot 100\% , \end{align*}
where s\([i]\) is the sensitivity of the \(i^{\rm th}\) parameter and is the product of the observational weigh times the Jacobian value for that particular observation \(j\) in relation to the parameter I, divided by the total observation \(m\). In the figure below an example is given of the relative sensitivity of different parameter during parameter estimation.
The most sensitive parameter is the storage (S) in figure 12.20, however, the parameter adjustment is adjusting the storage the least since the least sensitive parameter, the RIVER, determines the final parameter update vector. As can see in the decrease of the objective function, it is not the best thing to do. Whenever the sensitivity of the Rivers increase, the storage becomes more important in the gradient and the objection function declines more significantly.
In order to avoid the disturbance, and therefore slower convergence, due to insensitive parameters, iMODFLOW temporary holds those parameter(s). Whether a parameter is insensitive or not is determined by the ratio of their s\({}_{i}\) value compared to the total sensitive value, see Figure 4.8.
Nash–Sutcliffe efficiency (NSE) can be used to quantitatively describe the accuracy of model outputs. It is defined as:
\begin{align*} \rm {NSE} = 1.0 - \frac {\left ( \phi (\textbf {p}) - \overline {\textbf {y}} \right )^2 } { \left (\textbf {y} - \overline {\textbf {y}}\right )^2} \end{align*}
where \(\overline {\textbf {y}}\) is the average observation value. Nash–Sutcliffe efficiency can range from \(-\infty \) to 1.0. An efficiency of 1.0 corresponds to a perfect match of modelled values to the observed values. An efficiency of 0.0 indicates that the model predictions are as accurate as the mean of the observed data, whereas an efficiency \(<\) 0 occurs when the observed mean is a better predictor than the model or. Or in other words, when the residual variance (described by the numerator in the expression above), is larger than the data variance (described by the denominator). Essentially, the closer the model efficiency is to 1.0, the more accurate the model is. Threshold values to indicate a model of sufficient quality have been suggested between 0.5 < NSE < 0.65.
The goodness of fit of a statistical model describes how well it fits a set of observations, it is defined as:
\begin{align*} \rm {GOF} =& \sqrt { \frac { \left ( \phi (\textbf {p}) - \overline {\phi }(\textbf {p}) \right ) \times \left ( \textbf {y} - \overline {\textbf {y}} \right ) } { \left ( \phi (\textbf {p}) - \overline {\phi }(\textbf {p}) \right )^2 \times \left ( \textbf {y} - \overline {\textbf {y}} \right )^2 }}; \\ =& \sqrt {\frac {\sigma [\phi (\textbf {p})] \times \sigma [\textbf {y}]}{\sigma [\phi (\textbf {p})]^2 \times \sigma [\textbf {y}]^2} }. \end{align*}
Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions, or whether outcome frequencies follow a specified distribution. In fact, it is a normalized covariance that tends to be 1.0 if the spread of the distributions are equally and \(<\) 1.0 in case both distributions differ.
search infinite exponential variogram (range 2/3 pilot distance) limit points 10 (use number of point = 1 , nearest neighbour) lambda 0.01-100.0 honour pilotpoint location on gridmid
Kriging is a stochastic interpolation method used to estimate the value of an attribute at an unknown location. The beauty of Kriging is that coincides that the unknown values, that need to be interpolated, meet the statistics of the measurement points. Hence the predicted location obtains the lowest statistical probability of not yielding the desired result. Kriging uses prior knowledge about the spatial distribution of an event such as permeability’s: this prior knowledge encapsulates how permeability varies over a given region. Then, given a series of measurements of permeability’s, Kriging can predict permeability’s at unobserved points \(x^*\). The predicted value \(\hat {z}\) at location \(x^*\) is defined by:
\begin{align*} \hat {z}(x^*)= \begin {bmatrix} w_1, & w_2, &\cdots & w_n \\ \end {bmatrix} \cdot \begin {bmatrix} z_1 \\ z_2 \\ \vdots \\ z_n \\ \end {bmatrix} \end{align*}
where \(w_i\) are the weighting variables for each of the measurements \(z_i\). It is different than Inverse Distance Weighting (IDW) which also ignores the spatial location of the sample points and assign weight variables based on the distances between measuring points and unknown points. In Kriging, however, statistics is used to assign values to the weighting variable, such that all locations are related but nearby locations are more related than more distant locations. Moreover, for locations that are strongly correlated, those with less correlation with the estimation location, are effectively ignored. The covariances (interrelationship) and thus the Kriging weights, are determined entirely by the data. The measure that determines the interrelationship between all locations is given by the semivariance. It describes the variance of variables \(\gamma \) when the corresponding pair of points \(x_i\) and \(x_j\) are at some distance (\(d=\left |x_i-x_j\right |\)) apart, such that:
\begin{align*} \gamma \left (d\right )=\frac {1}{2N\left (d\right )}\sum _{i=1}^{N\left (d\right )}{\left (z\left (x_i\right )-z\left (x_i+d\right )\right )^2} \end{align*}
where \(z_i\) is the value of the measurement at location \(i\), \(z_j\) at location \(j\) within the distance \(d\) and \(N(d)\) represents the number of data points that belong to \(d\), normally \(d\) represents a certain tolerance of distance. The correlation between the variables solely depends on the spatial distance that separates them and is independent of its true location (i.e. stationarity of the second moment). The key here is to take as many samples within the study region as possible to insure the statistical accuracy of the semivariance at each distinct distance \(d\) and be able to calculate an average semivariance (from all sample points). Those semivariances combined, defines a semivariogram. Intersecting the semivariogram in three places provides the nugget (the height of the jump of the semivariogram at the origin), the sill (highest level of semivariance) and the range (where the line of best fit for the semivariance has slope of zero). This means that beyond the range no relationship exists between corresponding pair of points. The nugget effect can be attributed to measurement errors or spatial sources of variation at distances smaller than the sampling interval or both. Measurement error occurs because of the error inherent in measuring devices. Natural phenomena can vary spatially over a range of scales. Variation at microscales smaller than the sampling distances will appear as part of the nugget effect. Before collecting data, it is important to gain some understanding of the scales of spatial variation. It often happens that the semivariogram is a good match for spherical functions, however, linear and/or exponential functions (acts as a kind of IDW) can be applicable as well. The following expressions are used for those relationships:
\begin{align*} \gamma (d)&=c_0+c_1*\frac {d}{\rm range} &{\rm linear} \\ \\ \gamma (d)&=c_0+c_1*1.5 \frac {d}{\rm range} - 0.5 (\frac {d}{\rm range})^3 &{\rm spherical} \\ \\ \gamma (d)&=c_0+c_1*1.0-{\rm 10.0}^{ (-3.0 \frac {d}{\rm range})} &{\rm exponential} \\ \\ \gamma (d)&=c_0+c_1*1.0-{\rm 10.0}^{(-3.0 \frac {d^2}{\rm range^2})} &{\rm gaussian} \\ \\ \gamma (d)&=c_0+c_1*d^{0.5} &{\rm power} \end{align*}
where \(d\) is the true distance between two locations. With those semivariances the weights are computed to be used in estimating values of the attribute value in unknown locations. Using the semivariances predicted from the semivariogram the results from the linear equations are the weights producing an interpolation minimizing the error in the predicted values. In these, two approaches are described and implemented in iPEST.
Simple Kriging
Simple Kriging assumes stationarity of the first moment (all variables have the same mean \(\bar {z}\)) over the entire domain. Eventually the following system is solved for each location \(x^*\) to obtain the weights \(\textbf {W}\):
\begin{align*} \textbf {W}= \textbf {K}_{\rm S}^{-1}\textbf {k}_{\rm S}= \begin {bmatrix} \gamma (x_1,x_1) & \cdots & \gamma (x_1,x_n) \\ \vdots & \ddots & \vdots \\ \gamma (x_n,x_1) & \cdots & \gamma (x_n,x_n) \\ \end {bmatrix}^{-1} \begin {bmatrix} \gamma (x_1,x^*) \\ \vdots \\ \gamma (x_n,x^*) \\ \end {bmatrix} \end{align*}
The assumption that you will know the exact mean \(\bar {z}\) is often unrealistic. However, sometimes it makes sense to assume that a physically based model gives a known trend. Then you can take the difference between that model and the observations, called residuals, and use Simple Kriging on the residuals, assuming the trend in the residuals is known to be zero. This is done in Simple Kriging.
Ordinary Kriging
In Ordinary Kriging a constant unknown mean \(\bar {z}\) is assumed only over the search neighborhood of \(x^*\). So, instead of subtracting the global mean for each random variable, the mean is computed for the values in the search neighborhood of each individual location \(x^*\).
\begin{align*} \begin {bmatrix} \textbf {W} \\ \mu \end {bmatrix}= \textbf {K}_{\rm O}^{-1}\textbf {k}_{\rm O}= \begin {bmatrix} \gamma (x_1,x_1) & \cdots & \gamma (x_1,x_n) & 1.0 \\ \vdots & \ddots & \vdots & \vdots \\ \gamma (x_n,x_1) & \cdots & \gamma (x_n,x_n) & 1.0 \\ 1.0 & \cdots & 1.0 & 0.0 \\ \end {bmatrix}^{-1} \begin {bmatrix} \gamma (x_1,x^*) \\ \vdots \\ \gamma (x_n,x^*) \\ 1.0 \\ \end {bmatrix} \end{align*}
where \(\mu \) is the Lagrange multiplier used in the minimization of the Kriging error \(\sigma ^2_k(x)\) to honor the unbiasedness condition. In other words, this \(\mu \) forces the sum of the Kriging weights to become \(1.0\). Ordinary Kriging can be used for data that seems to have a trend.
Kriging Variance
By Kriging it is possible to compute the variance \(\sigma ^2\) of the estimated points as well. Therefore, the computed Kriging weights \(\textbf {w}\) need to be multiplied with the matrix k, so:
\begin{align*} \sigma ^2_{\rm O} = \text {w}^{\rm T}\textbf {k}_{\rm O}= \begin {bmatrix} w_1, & \cdots & w_n, & \mu \\ \end {bmatrix} \begin {bmatrix} \gamma (x_1,x^*) \\ \vdots \\ \gamma (x_n,x^*) \\ 1.0 \end {bmatrix} \end{align*}
The standard deviation of the estimate becomes \(\sqrt {\sigma ^2_{\rm O}}\). The variance estimate depends entirely on the data configuration and the covariance function, not on the data values themselves.
Uncertainty in groundwater may be divided into two classes; intrinsic uncertainty and information uncertainty. The first class derives from the variability of certain natural properties or processes and is an irreducible uncertainty inherent to the system. The second class is the result of ”noisy“ or incomplete information about the system and may be reduced by various strategies, notable further measurements and analyses [Dettinger and Wilson(1981)]. The spatial and temporal variation of parameters, such as the recharge rate, and spatial variability of properties such as hydraulic conductivity are extremely complicated. The first idea for the First-Order Second Moment Method (FOSM) was applied to groundwater by [Dettinger and Wilson(1981)]. The first moment of the heads is assumed to first-order accuracy by the mean heads obtained as the solution of the model using the optimized values for the model parameters. The variance of the head is denoted as the Second-Order moment. The FOSM method is a approximation method and one of the most widely applied in engineering design. One of the advantages of the method is that it allows the estimation of uncertainty in the output variables without knowing the shapes of pdfs of input variables in detail.
vergelijk parameter covariance matrix vooraf en achteraf !!!
The Jacobian expresses the sensitivity in \(\phi \) to variations or uncertainties in the parameters. In this case the Jacobian is computed for all nodes in the model, and not only for the location where measurements are available, as done by the Levenberg-Marquardt Algorithm (LMA), so in this case \(\textbf {J}_{\rm fosm}\) is a \(N_M \times N_P\) matrix as the Jacobian \(\textbf {J}\) for the LMA is \(N_H \times N_P\), where \(N_M\) is the total number of model nodes.
\begin{align*} {\rm var}(\phi )={\rm diag}(\textbf {J}_{\rm fosm}\textbf {C}^{\rm p}\textbf {J}_{\rm fosm}^{\rm T}) \end{align*}
Finally the standard head error \(\sigma ^\phi \) due to the standard parameter error \(\sigma ^{\rm p}\) becomes \(\sqrt {{\rm var}(\phi )}\). Owing to its simplicity the First-Order Second Moment Method (FOSM) is one of the most widely used technique. Its problem, though, is it linearisation of the model-output function about the mean values of the input variables assuming to represent the statistical properties of the model output over the complete range of input values.
In this section a simple example is given for a single optimalisation step using the Levenberg-Marquardt algorithm as outlines in the previous sections. Suppose two parameters \(p_1=p_2=1.0\) need to be optimized that describe two different zones for permeability in a single modellayer. Two measurements are available, one measurement in each zone. The first simulation is to run the model with \(p_1=p_2=1.0\), thereafter with \(p_1=\Delta p\) and \(p_2=1.0\) and finally with \(p_1=1.0\) and \(p_2=\Delta p\) with \(\Delta p=1.1\). For each measurement location the measurements \(z_j\) is subtracted from the computed head \(\phi ^i_j\) for the \(i^{\rm th}\) simulation and stored in the matrix D [m], thus:
\begin{align*} \textbf {D}= \begin {bmatrix} \phi ^0_1-z_1 & \phi ^1_1-z_1 & \phi ^2_1-z_1 \\ \phi ^0_2-z_2 & \phi ^1_2-z_2 & \phi ^2_2-z_2 \\ \end {bmatrix} \Rightarrow \textbf {D}= \begin {bmatrix} -1.484 & -1.391 & -1.406 \\ -1.686 & -1.645 & -1.513 \\ \end {bmatrix} \end{align*}
From D it is obvious that both parameters decline the objective function by a multiplication of \(\Delta p\). An adjustment in \(p_1\) does reduce the measurement \(z_1\) mostly and \(p_2\) reduces the second measurement \(z_2\). However, both parameters influence both measurements and therefore both parameters cannot be estimated separately as they are correlated to eachother. This is more clear in the Jacobian J [m/\(\Delta p\)], defined as:
\begin{align*} \textbf {J}= \begin {bmatrix} (d^2_1-d^1_1)/\Delta p_1 & (d^3_1-d^1_1)/\Delta p_1 \\ (d^2_2-d^1_2)/\Delta p_2 & (d^3_1-d^1_1)/\Delta p_2 \\ \end {bmatrix} \Rightarrow \textbf {J}= \begin {bmatrix} 0.976 & 0.818 \\ 0.430 & 1.815 \\ \end {bmatrix} \end{align*}
where \(d^i_j\) represent the \(i^{\rm th}\) column and \(j^{\rm th}\) row from the D matrix. As it is common to apply a \(\log \) transformation whenever permeabilities are calibrated, the value for \(\Delta p=\log (1.1)=0.0953\). Based on the Jacobian, the Hessian H [\(\Delta p^2/m^2\)] is computed as:
\begin{align*} \textbf {H}=(\textbf {J}^{\rm T}\textbf {QJ}+\psi \textbf {I})^{-1}= \begin {bmatrix} 1.137 & 1.585 \\ 1.585 & 3.977 \\ \end {bmatrix}^{-1} = \begin {bmatrix} 1.979 & -0.789 \\ -0.789 & 0.566 \\ \end {bmatrix} \end{align*}
where Q is an identity matrix with diagonal values of \(1.0\), so every measurement has equal weights. The Hessian is proportional to the curvature of the plane of the objective function \(\Phi \), it implies a large step in the direction with low curvature (i.e., an almost flat terrain) and a small step in the direction with high curvature (i.e, a steep incline). From the Hessian the parameter covariance \(\textbf {C}^{\rm p}\) [\(\Delta p^2\)] is estimated as:
\begin{align*} \textbf {C}^{\rm p}=\frac {\Phi _m}{{\rm max}(1.0,N_h-N_p)}\textbf {H}= \frac {4.266}{1} \begin {bmatrix} 1.979 & -0.789 \\ -0.789 & 0.566 \\ \end {bmatrix} = \begin {bmatrix} 8.442 & -3.365 \\ -3.365 & 2.414 \\ \end {bmatrix} \end{align*}
where the total objective function \(\Phi =4.266\) m\(^2\) and the number of measurements \(N_h=2\) and the number of parameter \(N_p=2\). From these parameter covariance matrix the parameter standard error \(\textbf {e}^{\rm p}\) [\(\Delta p\)] can be computed as:
\begin{align*} \textbf {e}^{\rm p}=\sqrt {{\rm diag}(\textbf {C}^{\rm p})}= \begin {bmatrix} 2.905 \\ 1.554 \\ \end {bmatrix} \end{align*}
In other words, the twice the parameter standard error represents \(95\%\) of the confident limits of the parameter. In this case the current parameter value is highly uncertain as it \(\Delta p_1 \pm 2.905\) and \(\Delta p_2 \pm 1.554\). Since the parameter is log transformed, the confident limits of the parameters are currently:
\begin{align*} {\rm exp}(\Delta p_1=0.0 \pm 2.905) & \longrightarrow 0.0033 > \Delta p_1 < 297.32 \\ {\rm exp}(\Delta p_2=0.0 \pm 1.554) & \longrightarrow 0.0476 > \Delta p_2 < 21.02 \\ \end{align*}
As the optimization of the parameters continues, this parameter standard error should decline as the objective function value \(\Phi \) declines and the Jacobian will be become more flattened in the minimum of the plane of \(\Phi \). As a result the confident limits decreases as well. Another useful parameter is the correlation matrix \(\textbf {R}^{\rm p}\) that can be computed from the covariance matrix \(\textbf {C}^{\rm p}\) easily.
\begin{align*} \textbf {R}^{\rm p}= \begin {bmatrix} 1.0 & -0.745 \\ -0.745 & 1.0 \\ \end {bmatrix} \end{align*}
From this correlation matrix is can be observed that both parameters are not correlated significantly and can be determined in combination. Another method of determining the uniqueness of the covariance matrix is the computation of its eigenvalues and eigenvectors, so:
\begin{align*} (\textbf {J}^{\rm T}\textbf {QJ})\textbf {v}=\Lambda \textbf {v}= \begin {bmatrix} 1.137 & 1.585 \\ 1.585 & 3.977 \\ \end {bmatrix} \begin {bmatrix} 0.408 & 0.913 \\ 0.913 & -0.408 \\ \end {bmatrix} = \begin {bmatrix} 91.611 \\ 8.389 \\ \end {bmatrix} \begin {bmatrix} 1.137 & 1.585 \\ 1.585 & 3.977 \\ \end {bmatrix} \end{align*}
where \(\Lambda \) contains the ordered eigenvalues (singular values are \(\Lambda ^\frac {1}{2}\)) and \(\textbf {v}\) are the eigenvectors. The eigenvectors are representing the axis of the residual surface, the eigenvalues \(\sqrt {\lambda _i}\) represent the length of axis \(i\). The ratio between the first and last eigenvalues is the condition number \(\kappa \), thus \(\kappa =91.611/8.389=10.920\). This condition number \(\kappa \) is relatively small and therefore the problem is well-conditioned; the parameters are identifiable, they all contribute significantly and unique to the residual surface.
The Gradient Descent gradient \(\Delta \textbf {p}^*\) of the parameter adjustment becomes:
\begin{align*} \Delta \textbf {p}^* =-\textbf {J}^{\rm T}\textbf {Qd}_1= -\begin {bmatrix} 0.976 & 0.818 \\ 0.430 & 1.815 \\ \end {bmatrix} \begin {bmatrix} 1.0 & 1.0 \\ 0.0 & 0.0 \\ \end {bmatrix} \begin {bmatrix} -1.484 \\ -1.686 \\ \end {bmatrix} = \begin {bmatrix} 2.177 \\ 4.280 \\ \end {bmatrix} \end{align*}
The parameter update becomes the \({\rm exp}()\) of the computed Gradient Descent gradient, so \(\Delta p_1=8.82\) and \(\Delta p_2=72.24\), which is rather huge that can be reduced within the limits of a trust hypersphere. Whenever the Hessian H is included (at this stage the \(\psi \) is assumed to be 0.0 meaning that the full Gauss-Newton algorithm is applied, this \(\psi \) normally increases during the optimisation such that in the final stage the parameter update will be more equal to the Gradient Descent method) the final parameter update vector \(\Delta \textbf {p}\) becomes:
\begin{align*} \Delta \textbf {p} = -\textbf {HJ}^{\rm T}\textbf {Qd}_1 = -\begin {bmatrix} 1.979 & -0.789 \\ -0.789 & -0.566 \\ \end {bmatrix} \begin {bmatrix} -2.177 \\ -4.280 \\ \end {bmatrix} = \begin {bmatrix} 0.931 \\ 0.705 \\ \end {bmatrix} \end{align*}
The elements for the final parameter update \(\Delta \textbf {p}\) becomes \(\Delta p_1={\rm exp}(0.0+0.931)=2.537\) and \(\Delta p_2={\rm exp}(0.0+0.705)=2.024\). After this the sequence starts again.
Notes:
• Do not use parameter adjustments that are too large, whenever many parameters are concerned use a step size of 2, whenever you have less parameters you can increase this to a maximum of 10;
• Experiment with different starting values for a parameter to see whether you end up with the same optimal values;