# iMOD User Manual version 4.4 (html)

#### 12.33PST Parameter estimation

In mathematics and computing, the Levenberg-Marquardt algorithm (LMA) provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function. These minimization problems arise especially in least squares curve fitting and nonlinear programming. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach. The LMA is a very popular curve-fitting algorithm used in many software applications for solving generic curve-fitting problems. However, the LMA finds only a local minimum, not a global minimum.

The used algorithm is known as the Levenberg-Marquardt algorithm (LMA) that goes back to 1943 as Kenneth Levenberg presented his work to the Mathematical Society at their annual meeting (Levenberg, K. The Quarterly of Applied Mathematics 2, 164 (1944)). Marquardt on the other hand popularized the method by distributing his FORTRAN code for free. In the following section a brief overview is given of the Levenberg-Marquardt algorithm and its implementation in iMODFLOW.

Most of the following LMA implementation has been based on the paper of Olsthoorn (1995), Effective Parameter Optimization for Ground-Water Model Calibration, Groundwater, Vol. 33, no.1) and the paper of Knorr BM (2011), The Levenberg-Marquardt Algorithm, Seminar in Modern Physics, Summer 2011 and the PEST Manual of (Doherty2010).

##### 12.33.1Methodology

The core of parameter estimation is the minimization of some error criterion, cost or objective function $\Phi _{\rm m}(\textbf {p})$, that depends on a parameter vector p with elements $p_i \rightarrow i=1,N_p$ where $N_p$ denotes the number of unknowns to be optimized (i.e. the amount of parameters). In general the objective function $\Phi _{\rm m}(\textbf {p})$ is the sum of squares sum of the individual errors notated as:

\begin{align*} \Phi _{\rm m}(\textbf {p})=\left (\textbf {y}-\phi (\textbf {p})\right )^{\rm T}\textbf {Q}_1\left (\textbf {y}-\phi (\textbf {p})\right ) \end{align*}

where y are the measurements with elements $y_i \rightarrow i=1,N_h$; where $N_h$ denotes the number of observations; $\phi (\textbf {p})$ are the computed head for the parameters defined in p and Q is the weight matrix assigned to the observations defined as:

\begin{align*} Q_{i,i}=\frac {1}{\sigma ^2_i} \end{align*}

where $Q_{i,i}$ is the weight for the $i^{\rm th}$ observation. The variance $\sigma ^2_i$ is the squared standard deviation $\sigma _i$ that measures the amount of variation from the average. A low $\sigma _i$ indicates that an particular observation $y_i$ should be able to the meet the corresponding computed head $\phi _i$ more closely that observations with higher values of variations $\sigma$. It is possible to specify weight values $Q_{i,i}$ or variances $\sigma ^2_i$ in iPEST.

The Levenberg-Marquardt algorithm is applied to minimize the objective function value by adjusting the individual values for the parameter vector with $\alpha _i\textbf {p}_i$ where $\alpha _i$ is the optimal multiplication factor for the $i^{\rm th}$ parameter that yields a minimal objective function value. In order to arrive at a valid minimal objective function value, it is advisable to let $N_h$ substantially larger than the number of parameters $N_p$ (Yeh and Yoon (1981), Aquifer parameter identification with optimum dimension in parameterization, Wat.Res.Res, v17, no. 3, pp. 664-672) or apply regularisation as done by the Pilot Point concept (Doherty, 2003), see subsection Section 12.33.3.

The Gradient Descent Method (the simplest method) approaches the minimum of the objective function $\Phi _m(\textbf {p})$ by adjusting each parameter according to:

\begin{align*} \textbf {p}_{i+1} =\textbf {p}_{i} -\zeta _i \nabla \Phi _{\rm m}(\textbf {p}_{i}~), \end{align*}

where the subscript i denotes the sequential parameter iteration (parameter update cycle) and $\zeta _i$ is a weighting factor for the $i^{\rm th}$ cycle. It says that for each cycle $i$, the individual parameter sensitivities ($\nabla \Phi _{\rm m}(\textbf {p}))$ to the residual surface (i.e. the multi dimensional representation of the objective function), multiplied with a weighting factor $\zeta$ will contribute to a reduction on the objective function. In this way steps are taken towards the minimum according to the gradient $\nabla$ of the residual surface. The problem is that the Gradient Descent Method takes large steps in those areas of the residual surface that have small gradients and it takes small steps for those areas with large gradients. It normally leads to zigzagging in long narrow valleys on the $\Phi _{\rm m}(\textbf {p})$ residual surface. The Gauss-Newton method replaces the scaling factor $\zeta$ by the inverse of the curvature (second derivative$\nabla ^{2}\Phi _{\rm m}(\textbf {p})$ often called the Hessian) of the $\Phi _{\rm m}(\textbf {p})$ surface and interchanges the undesired behaviour of the Gradient Descent Method. It therefore converges faster, so:

\begin{align*} \textbf {p}_{i+1} =\textbf {p}_{i} -\left (\nabla ^{2} \Phi _{\rm m}(\textbf {p}_{i}~)\right )^{-1} \nabla \Phi _{\rm m}(\textbf {p}_{i}~). \end{align*}

The gradient $\nabla \Phi _{\rm m}(\textbf {p})$ is denoted as the Jacobian J and each column in that matrix J is defined by:

\begin{align*} \textbf {J} =\frac {\partial \phi }{\partial \textbf {p}_i } =\frac {\phi (\textbf {\textbf {p}}_i )-\phi (\textbf {p}_0 )}{\Delta \alpha _i } , \end{align*}

where $\textbf {J}$ is a Jacobian matrix $N_h \times N_p$ (number of observations row wise and number of parameters column wise) and represents the sensitivity of the residual for each observation point according to a small perturbation $\Delta \alpha _i$ in the $i^{\rm th}$ parameter compared to the original parameter value p${}_{0}$. Since the algorithm assumes linearity over the interval $\Delta \alpha _i$, the second derivative $\nabla ^2\Phi _{\rm m}(\textbf {p})$ is approximated by $\textbf {J}^{\rm T}\textbf {QJ}$ in the neighbourhood of $\textbf {p}_0$. This yields the following Gauss-Newton parameter update formula:

\begin{align*} \textbf {p}_{i+1} &=\textbf {p}_i+\Delta \textbf {p}_i \\ \\ \Delta \textbf {p}_i&=-2\left (\textbf {J}^{\rm T} \textbf {Q} \textbf {J}\right )^{-1} \textbf {J}^{\rm T} \textbf {Q} \left (\textbf {y}-\phi (\textbf {p})\right ) \end{align*}

where Q is the diagonal weight matrix with individual weight values $q_i$ in the diagonal, where the product $-2\textbf {J}^{\rm T}\textbf {Q}\left (\textbf {y}-\phi (\textbf {p})\right )$ represents the steepest gradient on the residual surface. If parameters are far from their optimum, which they are probably initially, this $\textbf {J}^{\rm T}\textbf {QJ}$ is only a crude approximation of the true Hessian matrix. As a result the parameter update vector $\Delta \textbf {p}$ might be quite wrong which can result in a failure to converge. The great insight of Levenberg was to simply combine the Gradient Descent and the Gauss-Newton Methods to include a damping factor $\lambda$ which determines how much of the Gradient Descent or Gauss-Newton Method to include, so:

\begin{align*} \Delta \textbf {p}=-2\left (\textbf {J}^{\rm T} \textbf {Q} \textbf {J}-\psi \textbf {I}\right )^{-1} \textbf {J}^{\rm T} \textbf {Q} \left (\textbf {y}-\phi (\textbf {p})\right ) \end{align*}

where I is the identity matrix. Whenever $\psi$ is large, the parameter update will be determined more by the Gradient Descent Methods and whenever $\psi$ is small, the Gauss-Newton Method will be included significantly. The great insight of Marquardt is to replace the identity matrix I by the diagonal of the matrix $\textbf {J}^{\rm T} \textbf {Q} \textbf {J}$, so:

\begin{align*} \Delta \textbf {p}=-2\left (\textbf {J}^{\rm T} \textbf {Q} \textbf {J}-\psi {\rm diag}(\textbf {J}^{\rm T} \textbf {Q} \textbf {J})\right )^{-1} \textbf {J}^{\rm T} \textbf {Q} \left (\textbf {y}-\phi (\textbf {p})\right ) \end{align*}

Olsthoorn (1994) suggested to adjust $\psi$ such that the yielding parameter update vector $\Delta$p is within a certain trust hyper sphere (with a radius around the current parameter vector which is determines by their minimal en maximal values and a maximal $\Delta$p compared to the previous iteration). So, by starting at a small value for $\psi$ (full confidence in the contribution of Gauss-Newton), it will increase until all parameter vectors are within their trust hyper sphere. It should be noticed that parameters that exceed their upper and or lower bounds, within their trust hyper sphere, will influence the parameter update for the other parameters and coming iterations. This is circumvented by temporarily remove the Jacobian vector $\textbf {J}_i$ for that particular parameter $i$ and hold the parameter at their upper- or lower bounds (i.e. a frozen parameter). On later iterations, the parameter will be included whenever the parameters vector will be calculated that moves parameters from their bounds back into the allowed parameter domain.

In the figure below it is demonstrated how the trusted hyper sphere affects the behaviour of convergence to the minimum of the $\Phi _{\rm m}(\textbf {p})$ surface. In practice, small trust hyper spheres will yield more iterations than large trust hyper spheres; however, the latter can zigzag from one side of the valley onto the other side. In table the following behavior has to be expected for the different settings. Figure 12.19: Example of the different behaviours in a common $\Phi _{\rm m}(\textbf {p})$ surface for different trust hyper spheres, purple=1000, green=100, red=10 and blue=2. Solid lines are Levenberg and dashed lines are Marquardt.
##### 12.33.2Eigenvalue Decomposition

Identifiability of the parameters to be optimized is a prerequisite of model calibration. This means that a unique set of parameters values $\textbf {p}$ yields a unique head $\phi$ field. In fact, all information for identifiability of the parameters is contained in the Jacobian matrix of the model at the optimum values of the parameters.

\begin{align*} \textbf {J}=\left [\frac {\partial \phi }{\partial p_i},\frac {\partial \phi }{\partial p_{i+1}},...,\frac {\partial \phi }{\partial p_{n_p}}\right ] \end{align*}

The Jacobian reveal the observability of the parameters by virtue of its rows and columns. Each row expresses the sensitivity of a single observation with respect the set of parameters. Each column expresses the sensitivity of all observations with respect to a single parameter. Some fundamental studies have been carried out which shed light of the inverse problem with respect to the identifiability parameters (Dietrich, 1990, Speed and Ahlfeld, 1996). These are based on the singular value decomposition of the sensitivity matrix, they even define the dimensions of the inverse problem, given the model and the data. Their conclusions should be valid if they are based on the optimum parameters and if the model is no too non-linear near its optimum.

\begin{align*} \left (\textbf {J}^{\rm T}\textbf {Q}\textbf {J}\right )\textbf {v}=\Lambda \textbf {v} \end{align*}

where $\Lambda$ contains the ordered eigenvalues (singular values are $\Lambda ^\frac {1}{2}$) and $\textbf {v}$ are the eigenvectors. The eigenvectors are representing the axis of the residual surface, the eigenvalues $\sqrt {\lambda _i}$ represent the length of axis $i$. The ratio between the first and last eigenvalues is the condition number $\kappa =\lambda _1/\lambda _n$. As a general rule of thumb, if the condition number $\kappa (A)=10^k\Longrightarrow k=\log (\kappa )$, then you may lose up to $k$ digits of accuracy on top of what would be lost to the numerical method due to loss of precision from arithmetic methods. However, the condition number does not give the exact value of the maximum inaccuracy that may occur in the algorithm. It generally just bounds it with an estimate (whose computed value depends on the choice of the norm to measure the inaccuracy). An informal rule of thumb is that if the condition number $\kappa =15$, multicollinearity is a concern (multicollinearity or collinearity is a statistical phenomenon in which two or more variables in a multiple regression model are highly correlated); if it is greater than $\kappa > 30$ multicollinearity is a very serious concern. But again, these are just informal rules of thumb. The size of the condition number determines the shape of the minimum in the residual surface. Whenever the condition number is small, the parameters are identifiable, they all contribute significantly and unique to the residual surface. Whenever the eigenvalues decomposition leads to eigenvalues of zero, it means that the matrix $\textbf {J}^{\rm T}\textbf {Q}\textbf {J}$ is singular; in that case the determinant is zero. Whenever this occurs, it means that the current rank of the Jacobian matrix is less than the actual dimensions of the Jacobian matrix, in other words, there is a redundancy in the data, as a consequent it is impossible to generate a unique solution. There is a true linear dependency in the data. More often, the smallest eigenvalue might be very small that blow up small errors in the observations and cause large error in (some of) the estimated parameters. It is important to compute those eigenvalues from the covariance matrix stored in double precision.

##### 12.33.3Pilot Points and Regularisation

###### 12.33.3.1Kriging

Kriging is a stochastic interpolation method used to estimate the value of an attribute at an unknown location. The beauty of Kriging is that coincides that the unknown values, that need to be interpolated, meet the statistics of the measurement points. Hence the predicted location obtains the lowest statistical probability of not yielding the desired result. Kriging uses prior knowledge about the spatial distribution of an event such as permeability’s: this prior knowledge encapsulates how permeability varies over a given region. Then, given a series of measurements of permeability’s, Kriging can predict permeability’s at unobserved points $x^*$. The predicted value $\hat {z}$ at location $x^*$ is defined by:

\begin{align*} \hat {z}(x^*)= \begin {bmatrix} w_1, & w_2, &\cdots & w_n \\ \end {bmatrix} \cdot \begin {bmatrix} z_1 \\ z_2 \\ \vdots \\ z_n \\ \end {bmatrix} \end{align*}

where $w_i$ are the weighting variables for each of the measurements $z_i$. It is different than Inverse Distance Weighting (IDW) which also ignores the spatial location of the sample points and assign weight variables based on the distances between measuring points and unknown points. In Kriging, however, statistics is used to assign values to the weighting variable, such that all locations are related but nearby locations are more related than more distant locations. Moreover, for locations that are strongly correlated, those with less correlation with the estimation location, are effectively ignored. The covariances (interrelationship) and thus the Kriging weights, are determined entirely by the data. The measure that determines the interrelationship between all locations is given by the semivariance. It describes the variance of variables $\gamma$ when the corresponding pair of points $x_i$ and $x_j$ are at some distance ($d=\left |x_i-x_j\right |$) apart, such that:

\begin{align*} \gamma \left (d\right )=\frac {1}{2N\left (d\right )}\sum _{i=1}^{N\left (d\right )}{\left (z\left (x_i\right )-z\left (x_i+d\right )\right )^2} \end{align*}

where $z_i$ is the value of the measurement at location $i$, $z_j$ at location $j$ within the distance $d$ and $N(d)$ represents the number of data points that belong to $d$, normally $d$ represents a certain tolerance of distance. The correlation between the variables solely depends on the spatial distance that separates them and is independent of its true location (i.e. stationarity of the second moment). The key here is to take as many samples within the study region as possible to insure the statistical accuracy of the semivariance at each distinct distance $d$ and be able to calculate an average semivariance (from all sample points). Those semivariances combined, defines a semivariogram. Intersecting the semivariogram in three places provides the nugget (the height of the jump of the semivariogram at the origin), the sill (highest level of semivariance) and the range (where the line of best fit for the semivariance has slope of zero). This means that beyond the range no relationship exists between corresponding pair of points. The nugget effect can be attributed to measurement errors or spatial sources of variation at distances smaller than the sampling interval or both. Measurement error occurs because of the error inherent in measuring devices. Natural phenomena can vary spatially over a range of scales. Variation at microscales smaller than the sampling distances will appear as part of the nugget effect. Before collecting data, it is important to gain some understanding of the scales of spatial variation. It often happens that the semivariogram is a good match for spherical functions, however, linear and/or exponential functions (acts as a kind of IDW) can be applicable as well. The following expressions are used for those relationships:

\begin{align*} \gamma (d)&=c_0+c_1*\frac {d}{\rm range} &{\rm linear} \\ \\ \gamma (d)&=c_0+c_1*1.5 \frac {d}{\rm range} - 0.5 (\frac {d}{\rm range})^3 &{\rm spherical} \\ \\ \gamma (d)&=c_0+c_1*1.0-{\rm exp}(-3.0 \frac {d}{\rm range}) &{\rm exponential} \\ \\ \gamma (d)&=c_0+c_1*1.0-{\rm exp}(-3.0 \frac {d^2}{\rm range^2}) &{\rm gaussian} \\ \\ \gamma (d)&=c_0+c_1*d^{0.5} &{\rm power} \end{align*}

where $d$ is the true distance between two locations. With those semivariances the weights are computed to be used in estimating values of the attribute value in unknown locations. Using the semivariances predicted from the semivariogram the results from the linear equations are the weights producing an interpolation minimizing the error in the predicted values. In these, two approaches are described and implemented in iPEST.

Simple Kriging

Simple Kriging assumes stationarity of the first moment (all variables have the same mean $\bar {z}$) over the entire domain. Eventually the following system is solved for each location $x^*$ to obtain the weights $\textbf {W}$:

\begin{align*} \textbf {W}= \textbf {K}_{\rm S}^{-1}\textbf {k}_{\rm S}= \begin {bmatrix} \gamma (x_1,x_1) & \cdots & \gamma (x_1,x_n) \\ \vdots & \ddots & \vdots \\ \gamma (x_n,x_1) & \cdots & \gamma (x_n,x_n) \\ \end {bmatrix}^{-1} \begin {bmatrix} \gamma (x_1,x^*) \\ \vdots \\ \gamma (x_n,x^*) \\ \end {bmatrix} \end{align*}

The assumption that you will know the exact mean $\bar {z}$ is often unrealistic. However, sometimes it makes sense to assume that a physically based model gives a known trend. Then you can take the difference between that model and the observations, called residuals, and use Simple Kriging on the residuals, assuming the trend in the residuals is known to be zero. This is done in Simple Kriging.

Ordinary Kriging

In Ordinary Kriging a constant unknown mean $\bar {z}$ is assumed only over the search neighborhood of $x^*$. So, instead of subtracting the global mean for each random variable, the mean is computed for the values in the search neighborhood of each individual location $x^*$.

\begin{align*} \begin {bmatrix} \textbf {W} \\ \mu \end {bmatrix}= \textbf {K}_{\rm O}^{-1}\textbf {k}_{\rm O}= \begin {bmatrix} \gamma (x_1,x_1) & \cdots & \gamma (x_1,x_n) & 1.0 \\ \vdots & \ddots & \vdots & \vdots \\ \gamma (x_n,x_1) & \cdots & \gamma (x_n,x_n) & 1.0 \\ 1.0 & \cdots & 1.0 & 0.0 \\ \end {bmatrix}^{-1} \begin {bmatrix} \gamma (x_1,x^*) \\ \vdots \\ \gamma (x_n,x^*) \\ 1.0 \\ \end {bmatrix} \end{align*}

where $\mu$ is the Lagrange multiplier used in the minimization of the Kriging error $\sigma ^2_k(x)$ to honor the unbiasedness condition. In other words, this $\mu$ forces the sum of the Kriging weights to become $1.0$. Ordinary Kriging can be used for data that seems to have a trend.

Kriging Variance

By Kriging it is possible to compute the variance $\sigma ^2$ of the estimated points as well. Therefore, the computed Kriging weights $\textbf {w}$ need to be multiplied with the matrix k, so:

\begin{align*} \sigma ^2_{\rm O} = \text {w}^{\rm T}\textbf {k}_{\rm O}= \begin {bmatrix} w_1, & \cdots & w_n, & \mu \\ \end {bmatrix} \begin {bmatrix} \gamma (x_1,x^*) \\ \vdots \\ \gamma (x_n,x^*) \\ 1.0 \end {bmatrix} \end{align*}

The standard deviation of the estimate becomes $\sqrt {\sigma ^2_{\rm O}}$. The variance estimate depends entirely on the data configuration and the covariance function, not on the data values themselves.

##### 12.33.4First-Order Second Moment Method (FOSM)

Uncertainty in groundwater may be divided into two classes; intrinsic uncertainty and information uncertainty. The first class derives from the variability of certain natural properties or processes and is an irreducible uncertainty inherent to the system. The second class is the result of ”noisy“ or incomplete information about the system and may be reduced by various strategies, notable further measurements and analyses (DettingerAndWilson1981). The spatial and temporal variation of parameters, such as the recharge rate, and spatial variability of properties such as hydraulic conductivity are extremely complicated. The first idea for the First-Order Second Moment Method (FOSM) was applied to groundwater by (DettingerAndWilson1981). The first moment of the heads is assumed to first-order accuracy by the mean heads obtained as the solution of the model using the optimized values for the model parameters. The variance of the head is denoted as the Second-Order moment. The FOSM method is a approximation method and one of the most widely applied in engineering design. One of the advantages of the method is that it allows the estimation of uncertainty in the output variables without knowing the shapes of pdfs of input variables in detail.

How the uncertainty of the parameters propagates into the head uncertainty is given by the parameter covariance matrix $\textbf {C}^{\rm p}$. Here for we can compute the parameter covariance matrix as the total objective function value $\Phi _{\rm m}$ divided by the degrees of freedom, so:

\begin{align*} \textbf {C}^{\rm p}&=\frac {\Phi _{\rm m}}{N_h-N_p}\left (\textbf {J}^{\rm T}\textbf {Q}\textbf {J}\right )^{-1} \\ \\ \textbf {e}^{\rm p}_i&=\sqrt {C^{\rm p}_{i,i}} \end{align*}

the entries of the diagonal elements of this parameter covariance matrix are the squared standard deviations of the uncertainties of the zonal values; the confident limits (standard error $\textbf {e}^{\rm p}_i$) of the parameters. So, the true parameter value $p_i$ might be in between:

\begin{align*} p_i-e^{\rm p}_i\leq p_i \geq p_i+e^{\rm p}_i \end{align*}

This standard parameter error $\textbf {e}^{\rm p}$ is a measure of how unexplained variability in the data propagates to variability in the parameters, and is essentially an error measure for the parameters. The variance indicates the range over which a parameter value could extend without affecting model fit too adversely.

Moreover, from this parameter covariance matrix the correlation coefficients can be computed as:

\begin{align*} R^{\rm p}_{i,j}=\frac {C^{\rm p}_{i,j}}{\sqrt {C^{\rm p}_{i,i}\cdot C^{\rm p}_{j,j}}} \end{align*}

The correlation matrix shows those parameters that are highly correlated whenever they have correlation factors of $>0.90$ or $<-0.90$. This means that whenever it appears that parameter A would be larger in reality, this also will be the case for parameter B. In other words, one can be linearly predicted from the others with a non-trivial degree of accuracy.

The Jacobian expresses the sensitivity in $\phi$ to variations or uncertainties in the parameters. In this case the Jacobian is computed for all nodes in the model, and not only for the location where measurements are available, as done by the Levenberg-Marquardt Algorithm (LMA), so in this case $\textbf {J}_{\rm fosm}$ is a $N_M \times N_P$ matrix as the Jacobian $\textbf {J}$ for the LMA is $N_H \times N_P$, where $N_M$ is the total number of model nodes.

\begin{align*} {\rm var}(\phi )={\rm diag}(\textbf {J}_{\rm fosm}\textbf {C}^{\rm p}\textbf {J}_{\rm fosm}^{\rm T}) \end{align*}

Finally the standard head error $\sigma ^\phi$ due to the standard parameter error $\sigma ^{\rm p}$ becomes $\sqrt {{\rm var}(\phi )}$. Owing to its simplicity the First-Order Second Moment Method (FOSM) is one of the most widely used technique. Its problem, though, is it linearisation of the model-output function about the mean values of the input variables assuming to represent the statistical properties of the model output over the complete range of input values.

##### 12.33.5Scaling matrix S

For several problems, especially those involving different types of observations and parameters whose magnitudes may differ greatly, the elements of J can be vastly different in magnitude. This can lead to roundoff errors. Fortunately, this can be circumvented to some extent through the use of a scaling matrix S. Let S be a square, $n \times n$ matrix with diagonal elements only, the $i^{\rm th}$ diagonal element of S being given by:

##### 12.33.6Sensitivity

The possibility that a parameter estimation problem runs smoothly decreases with the number of parameters. In highly parameterized problems some parameters are likely to be more sensitive in comparison with others. As a result, the Levenberg-Marquardt algorithm may decide that large adjustments are required for their values if they are to make any significant contribution to reducing the objective function F(p). However, limits are set on parameter changes, such that the magnitude (but not the direction) of the parameter update vector is reduced. If a parameter is particularly insensitive compared to others, it may denominate the parameter update vector, yielding a large update vector. This need to be trimmed to fit the limits of the parameter update and as a result the update for other parameters might not change much at all, with a result that the objective function might be not reduced significantly at all. The result is that the convergence takes place intolerable slowly (or not at all), with a huge wastage of model runs. The relative sensitivity for a parameter is computed by: $s_{i} =\frac {m^{-1} \sum \limits _{j=1}^{m} w_{j} j_{ij}}{\sum s_{i} } \cdot 100\% ,$ where s${}_{i}$ is the sensitivity of the $i{}^{\rm th}$ parameter and is the product of the observational weigh times the Jacobian value for that particular observation $j$ in relation to the parameter I, divided by the total observation $m$. In the figure below an example is given of the relative sensitivity of different parameter during parameter estimation. Figure 12.20: Sensitivity ratio of different parameters during the parameter estimation process.

The most sensitive parameter is the storage (S) in Figure 12.20, however, the parameter adjustment is adjusting the storage the least since the least sensitive parameter, the RIVER, determines the final parameter update vector. As can see in the decrease of the objective function, it is not the best thing to do. Whenever the sensitivity of the Rivers increase, the storage becomes more important in the gradient and the objection function declines more significantly. Figure 12.21: Parameter adjustments in relation to the reduction of the objective function value.

In order to avoid the disturbance, and therefore slower convergence, due to insensitive parameters, iMODFLOW temporary holds those parameter(s). Whether a parameter is insensitive or not is determined by the ratio of their s${}_{i}$ value compared to the total sensitive value, see Figure 4.8.

##### 12.33.7Example

In this section a simple example is given for a single optimalisation step using the Levenberg-Marquardt algorithm as outlines in the previous sections. Suppose two parameters $p_1=p_2=1.0$ need to be optimized that describe two different zones for permeability in a single modellayer. Two measurements are available, one measurement in each zone. The first simulation is to run the model with $p_1=p_2=1.0$, thereafter with $p_1=\Delta p$ and $p_2=1.0$ and finally with $p_1=1.0$ and $p_2=\Delta p$ with $\Delta p=1.1$. For each measurement location the measurements $z_j$ is subtracted from the computed head $\phi ^i_j$ for the $i^{\rm th}$ simulation and stored in the matrix D [m], thus:

\begin{align*} \textbf {D}= \begin {bmatrix} \phi ^0_1-z_1 & \phi ^1_1-z_1 & \phi ^2_1-z_1 \\ \phi ^0_2-z_2 & \phi ^1_2-z_2 & \phi ^2_2-z_2 \\ \end {bmatrix} \Rightarrow \textbf {D}= \begin {bmatrix} -1.484 & -1.391 & -1.406 \\ -1.686 & -1.645 & -1.513 \\ \end {bmatrix} \end{align*}

From D it is obvious that both parameters decline the objective function by a multiplication of $\Delta p$. An adjustment in $p_1$ does reduce the measurement $z_1$ mostly and $p_2$ reduces the second measurement $z_2$. However, both parameters influence both measurements and therefore both parameters cannot be estimated separately as they are correlated to eachother. This is more clear in the Jacobian J [m/$\Delta p$], defined as:

\begin{align*} \textbf {J}= \begin {bmatrix} (d^2_1-d^1_1)/\Delta p_1 & (d^3_1-d^1_1)/\Delta p_1 \\ (d^2_2-d^1_2)/\Delta p_2 & (d^3_1-d^1_1)/\Delta p_2 \\ \end {bmatrix} \Rightarrow \textbf {J}= \begin {bmatrix} 0.976 & 0.818 \\ 0.430 & 1.815 \\ \end {bmatrix} \end{align*}

where $d^i_j$ represent the $i^{\rm th}$ column and $j^{\rm th}$ row from the D matrix. As it is common to apply a $\log$ transformation whenever permeabilities are calibrated, the value for $\Delta p=\log (1.1)=0.0953$. Based on the Jacobian, the Hessian H [$\Delta p^2/m^2$] is computed as:

\begin{align*} \textbf {H}=(\textbf {J}^{\rm T}\textbf {QJ}+\psi \textbf {I})^{-1}= \begin {bmatrix} 1.137 & 1.585 \\ 1.585 & 3.977 \\ \end {bmatrix}^{-1} = \begin {bmatrix} 1.979 & -0.789 \\ -0.789 & 0.566 \\ \end {bmatrix} \end{align*}

where Q is an identity matrix with diagonal values of $1.0$, so every measurement has equal weights. The Hessian is proportional to the curvature of the plane of the objective function $\Phi$, it implies a large step in the direction with low curvature (i.e., an almost flat terrain) and a small step in the direction with high curvature (i.e, a steep incline). From the Hessian the parameter covariance $\textbf {C}^{\rm p}$ [$\Delta p^2$] is estimated as:

\begin{align*} \textbf {C}^{\rm p}=\frac {\Phi _m}{{\rm max}(1.0,N_h-N_p)}\textbf {H}= \frac {4.266}{1} \begin {bmatrix} 1.979 & -0.789 \\ -0.789 & 0.566 \\ \end {bmatrix} = \begin {bmatrix} 8.442 & -3.365 \\ -3.365 & 2.414 \\ \end {bmatrix} \end{align*}

where the total objective function $\Phi =4.266$ m$^2$ and the number of measurements $N_h=2$ and the number of parameter $N_p=2$. From these parameter covariance matrix the parameter standard error $\textbf {e}^{\rm p}$ [$\Delta p$] can be computed as:

\begin{align*} \textbf {e}^{\rm p}=\sqrt {{\rm diag}(\textbf {C}^{\rm p})}= \begin {bmatrix} 2.905 \\ 1.554 \\ \end {bmatrix} \end{align*}

In other words, the twice the parameter standard error represents $95\%$ of the confident limits of the parameter. In this case the current parameter value is highly uncertain as it $\Delta p_1 \pm 2.905$ and $\Delta p_2 \pm 1.554$. Since the parameter is log transformed, the confident limits of the parameters are currently:

\begin{align*} {\rm exp}(\Delta p_1=0.0 \pm 2.905) & \longrightarrow 0.0033 > \Delta p_1 < 297.32 \\ {\rm exp}(\Delta p_2=0.0 \pm 1.554) & \longrightarrow 0.0476 > \Delta p_2 < 21.02 \\ \end{align*}

As the optimization of the parameters continues, this parameter standard error should decline as the objective function value $\Phi$ declines and the Jacobian will be become more flattened in the minimum of the plane of $\Phi$. As a result the confident limits decreases as well. Another useful parameter is the correlation matrix $\textbf {R}^{\rm p}$ that can be computed from the covariance matrix $\textbf {C}^{\rm p}$ easily.

\begin{align*} \textbf {R}^{\rm p}= \begin {bmatrix} 1.0 & -0.745 \\ -0.745 & 1.0 \\ \end {bmatrix} \end{align*}

From this correlation matrix is can be observed that both parameters are not correlated significantly and can be determined in combination. Another method of determining the uniqueness of the covariance matrix is the computation of its eigenvalues and eigenvectors, so:

\begin{align*} (\textbf {J}^{\rm T}\textbf {QJ})\textbf {v}=\Lambda \textbf {v}= \begin {bmatrix} 1.137 & 1.585 \\ 1.585 & 3.977 \\ \end {bmatrix} \begin {bmatrix} 0.408 & 0.913 \\ 0.913 & -0.408 \\ \end {bmatrix} = \begin {bmatrix} 91.611 \\ 8.389 \\ \end {bmatrix} \begin {bmatrix} 1.137 & 1.585 \\ 1.585 & 3.977 \\ \end {bmatrix} \end{align*}

where $\Lambda$ contains the ordered eigenvalues (singular values are $\Lambda ^\frac {1}{2}$) and $\textbf {v}$ are the eigenvectors. The eigenvectors are representing the axis of the residual surface, the eigenvalues $\sqrt {\lambda _i}$ represent the length of axis $i$. The ratio between the first and last eigenvalues is the condition number $\kappa$, thus $\kappa =91.611/8.389=10.920$. This condition number $\kappa$ is relatively small and therefore the problem is well-conditioned; the parameters are identifiable, they all contribute significantly and unique to the residual surface.

The Gradient Descent gradient $\Delta \textbf {p}^*$ of the parameter adjustment becomes:

\begin{align*} \Delta \textbf {p}^* =-\textbf {J}^{\rm T}\textbf {Qd}_1= -\begin {bmatrix} 0.976 & 0.818 \\ 0.430 & 1.815 \\ \end {bmatrix} \begin {bmatrix} 1.0 & 1.0 \\ 0.0 & 0.0 \\ \end {bmatrix} \begin {bmatrix} -1.484 \\ -1.686 \\ \end {bmatrix} = \begin {bmatrix} 2.177 \\ 4.280 \\ \end {bmatrix} \end{align*}

The parameter update becomes the ${\rm exp}()$ of the computed Gradient Descent gradient, so $\Delta p_1=8.82$ and $\Delta p_2=72.24$, which is rather huge that can be reduced within the limits of a trust hypersphere. Whenever the Hessian H is included (at this stage the $\psi$ is assumed to be 0.0 meaning that the full Gauss-Newton algorithm is applied, this $\psi$ normally increases during the optimisation such that in the final stage the parameter update will be more equal to the Gradient Descent method) the final parameter update vector $\Delta \textbf {p}$ becomes:

\begin{align*} \Delta \textbf {p} = -\textbf {HJ}^{\rm T}\textbf {Qd}_1 = -\begin {bmatrix} 1.979 & -0.789 \\ -0.789 & -0.566 \\ \end {bmatrix} \begin {bmatrix} -2.177 \\ -4.280 \\ \end {bmatrix} = \begin {bmatrix} 0.931 \\ 0.705 \\ \end {bmatrix} \end{align*}

The elements for the final parameter update $\Delta \textbf {p}$ becomes $\Delta p_1={\rm exp}(0.0+0.931)=2.537$ and $\Delta p_2={\rm exp}(0.0+0.705)=2.024$. After this the sequence starts again.

##### 12.33.8Remarks

Notes:

• • Do not use parameter adjustments that are too large, whenever many parameters are concerned use a step size of 2, whenever you have less parameters you can increase this to a maximum of 10;

• • Experiment with different starting values for a parameter to see whether you end up with the same optimal values;