The Parallel Krylov Solver (PKS) for speeding-up computations is a new package that is being developed in cooperation with the USGS, Utrecht University, Alterra and Delft University of Technology ([Verkaik et al.(2016)]). PKS is basically a linear solver that is largely based on the PCGU-solver in MODFLOW-USG for unstructured grids ([Panday et al.(2003)]) and is used in iMODFLOW for structured grids. PKS is suitable for coarse-grained problems, where the computational time for each subdomain is much larger than the actual communication overhead.

Subject for PKS is solving the linear system of equations, resulting from the finite-volume discretization of the groundwater flow equation ([Harbaugh(2005)]), to be denoted by:

\( \seteqsection {12} \)

\( \seteqnumber {8} \)

\begin{equation} \label {eq:lineareq} A\mathbf {h} = \mathbf {b}, \end{equation}

where \(\mathbf {A}\) is a square, symmetric, positive-definite coefficient matrix [L\(^{2}\)/T] that includes cell-by-cell conductances, calculated using cell dimensions and hydraulic conductivities at cell interfaces, and unknown components of sink/source and storage terms; \(\mathbf {h}\) is the vector of unknown heads at time \(t\); and \(\mathbf {b}\) is a known vector of source/sink and storage terms.

It can be shown mathematically that convergence of iterative methods is strongly related to the spectral properties (eigenvalues) of the coefficient matrix. Instead of solving the system 12.8, it is more efficient to solve the (left-)preconditioned equivalent

\( \seteqsection {12} \)

\( \seteqnumber {9} \)

\begin{equation} \label {eq:lineareqprec} M^{-1}A\mathbf {h} = M^{-1}\mathbf {b}, \end{equation}

where \(M\) is called the preconditioner. Choosing \(M\) is usually done such that \(M\) is a good approximation of \(A\), the eigenvalues of \(M^{-1}A\) are clustered around 1, and the system involving \(M\) is much easier to solve than the original system.

The PKS solver is a so-called Schwarz domain decomposition solver ([Dolean et al.(2015)]) and uses a parallel Additive Schwarz (AS) preconditioner^{1}:

\( \seteqsection {12} \)

\( \seteqnumber {10} \)

\begin{equation} \label {eq:blockjacobi} M^{-1}_{jac} = \sum _{i}{A^{-1}_{i}}, \end{equation}

where \(A_{i}\) corresponds to the (local) interior coefficients of the (global) coefficient matrix \(A\) for sub-domain \(i\). Basically, this corresponds to the block-Jacobi (block-diagonal) preconditioner, after numbering the unknowns over the sub-domains.

PKS solves the preconditioned sytem 12.9 with the AS-preconditioner 12.10 using the Conjugate Gradient (CG) Krylov subspace method, resulting in the so-called CG-Schwarz method. Within CG, application of \(M^{-1}_{jac}\) can be fully done in parallel. The corresponding sub-domain solutions are obtained inaccurately by ILU(0) (zero fill-in incomplete LU-factorization) preconditioning of \(A_{i}\) only. PKS only supports Dirichlet interface transmission conditions and does not (yet) support coarse-space correction like deflation.

Inter-subdomain communication PKS follows a so-called distributed memory parallellization approach, where each subdomain is uniquely assigned to one computational core (process), using local memory (RAM) only. In this way PKS is scalable regarding problem size and hardware. Exchanging data between subdomains is done by the Message Passing Interface, and typically involves communication for each CG inner (Schwarz) iteration, ensuring a tight coupling: local
(point-to-point) communication for the vector update and global (all-reduce) communications for computing interior products and evaluating stopping criteria. Since this is done for each inner iteration the expected speed-up in computational time with PKS largely depends on MPI (latency and bandwidth).

Typically, PKS is suitable through MPI on fast networks, like Infiniband/Myrinet on Linux clusters, or MPI through memory such as multi-core CPUs on desktop/laptop machines or shared-memory machines like SGI-clusters. For the current generation of multi-core CPUs, e.g. Intel Haswell E5-2698v3 16-core CPU, one should keep in mind that for large, memory-driven, groundwater models the actual bandwidth between processor and memory limits the parallel performance. In practice, this means it may be sometimes more beneficial to use less than the maximum number of cores.

**Note: **On desktops/laptops with multi-core CPUs it may be beneficial not to use all cores.

Load balance Besides MPI communication, load balance is also very important for parallel performance. This means that the actual work/load should be distributed as equally as possible over the multiple computational cores. PKS now supports two methods: uniform in x,y-direction, and the Recursive Coordinate Bisection (RCB) method. The RCB method, recursively, weight the user-defined load, by alternating in horizontal and vertical direction. Figure 12.18 shows an example of both methods for the Netherlands Hydrological Model ([De Lange et al. (2014)]) and 128 subdomains.

For this example the uniform partitioning method (left) results in some subdomains containing zero cells, since in this case the model boundary (IBOUND) is irregular, while the RCB method (right) results in a more optimal partitioning. Although RCB results in optimal load, it does not result in an optimal communication scheme regarding MPI. One should notice that even with the RCB and irregular boundaries, finding the optimal weight can be hard and subject for trial-and-error. For each MODFLOW cell, the weight can varying due to for example boundary conditions (stresses) and coupling concepts.

Input/output Another parallel performance aspect that is important is input/output (I/O). PKS supports parallel binary (clipping) reading of model IDF input files and parallel writing of IDF output files, that can be merged by the master MPI process afterwards. However, the actual I/O performance depends on the given hardware (e.g. type of hard drive) and file system configuration. When too many MPI processes simultaneously try to access one hard drive it can slow down I/O. This
is typically the case for common off-the-shelf desktops or laptops. On the other hand, super computers use parallel file-systems - e.g. Lustre file systems - to tackle this problem by accessing multiple hard drives simultaneously.

**Note: **In general, it is always recommended to try to minimize model output.

^{1} To be more precise, an overlapping Restricted Additive Schwarz preconditioner is implemented, but for iMOD, only a fixed overlap of 1 is being used, corresponding to the non-overlapping case.