Overland Flow

In this post, we'll look at overland flow -- how rainwater drains across a landscape. The equations of motion are pretty rowdy and have some fascinating effects. To derive them, we'll start from the shallow water or Saint Venant equations for the water layer thickness $h$ and velocity $u$:

$$\begin{align} \frac{\partial h}{\partial t} + \nabla\cdot hu & = \dot a \\ \frac{\partial}{\partial t}hu + \nabla\cdot hu\otimes u & = -gh\nabla (b + h) - k|u|u \end{align}$$

The final term in the momentum equation represents frictional dissipation and $k$ is a (dimensionless) friction coefficient. Using the shallow water equations for predicting overland flow is challenging because the thickness can go to zero.

For many thin open channel flows, however, the fluid velocity can be expressed purely in terms of the surface slope and other factors. You could arrive at one such simplification by assuming that the inertial terms in the momentum equation are zero:

$$k|u|u + gh\nabla(b + h) = 0.$$

This approximation is known as the Darcy-Weisbach equation. We'll use it in the following because it's simple and it illustrates all the major difficulties. For serious work, you'd probably want to use the Manning formula, as it has some theoretical justification for turbulent open channel flows. The overall form of the equation and the resulting numerical challenges are the same in each case.

Putting together the Darcy-Weisbach equation for the velocity with the mass conservation equation gives a single PDE for the water layer thickness:

$$\frac{\partial h}{\partial t} - \nabla\cdot\left(\sqrt{\frac{gh^3}{k}}\frac{\nabla(b + h)}{\sqrt{|\nabla(b + h)|}}\right) = \dot a.$$

This looks like a parabolic equation, but there's a catch! The diffusion coefficient is proportional to $h^{3/2}$, so it can go to zero when $h = 0$; all the theory for elliptic and parabolic equations assumes that the diffusion coefficient is bounded below. For a non-degenerate parabolic PDE, disturbances propagate with infinite speed. For the degenerate problem we're considering, that's no longer true -- the $h = 0$ contour travels with finite speed! While we're using the Darcy-Weisbach equation to set the velocity here, we still get finite propagation speed if we use the Manning equation instead. What's important is that the velocity is propertional to some power of the thickness and surface slope.

Eliminating the velocity entirely from the problem is convenient for analysis, but not necessarily the best way to go numerically. We'll retain the velocity as an unknown, which gives the resulting variational form much of the same character as the mixed discretization of the heat equation.

As our model problem, we'll use the dam break test case from Santillana and Dawson (2009). They discretized the overland flow equations using the local discontinuous Galerkin or LDG method, which extends DG for hyperbolic systems to mixed advection-diffusion problems. We'll use different numerics because Firedrake has all the hipster elements. I'm eyeballing the shape of the domain from their figures.

import numpy as np
import pygmsh

geometry = pygmsh.built_in.Geometry()

coords = np.array(
    [
        [0.0, 0.0],
        [3.0, 0.0],
        [3.0, 2.0],
        [2.0, 2.0],
        [2.0, 4.0],
        [3.0, 4.0],
        [3.0, 6.0],
        [0.0, 6.0],
        [0.0, 4.0],
        [1.0, 4.0],
        [1.0, 2.0],
        [0.0, 2.0],
    ]
)

points = [
    geometry.add_point(x, lcar=0.125) for x in
    np.column_stack((coords, np.zeros(len(coords))))
]
edges = [
    geometry.add_line(p1, p2) for p1, p2 in
    zip(points, points[1:] + [points[0]])
]

geometry.add_physical(edges)
loop = geometry.add_line_loop(edges)
plane_surface = geometry.add_plane_surface(loop)
geometry.add_physical(plane_surface)

with open("dam.geo", "w") as geo_file:
    geo_file.write(geometry.get_code())
    
!gmsh -2 -format msh2 -v 0 -o dam.msh dam.geo
import firedrake

mesh = firedrake.Mesh("dam.msh")
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.set_aspect("equal")
firedrake.triplot(mesh, axes=ax);

The bed profile consists of an upper, elevated basin, together with a ramp down to a lower basin.

from firedrake import Constant, min_value, max_value

x = firedrake.SpatialCoordinate(mesh)

y_0 = Constant(2.0)
y_1 = Constant(4.0)
b_0 = Constant(0.0)
b_1 = Constant(1.0)
b_expr = b_0 + (b_1 - b_0) * max_value(0, min_value(1, (x[1] - y_0) / (y_1 - y_0)))

S = firedrake.FunctionSpace(mesh, "CG", 1)
b = firedrake.interpolate(b_expr, S)
fig = plt.figure()
axes = fig.add_subplot(projection="3d")
axes.set_box_aspect((3.0, 6.0, 1.0))
axes.set_axis_off()
firedrake.trisurf(b, axes=axes);

As I alluded to before, rather than eliminate the velocity entirely, we'll keep it as a field to be solved for explicitly. The problem we're solving, while degenerate, is pretty similar to the mixed form of the heat equation. This suggests that we should use element pairs that are stable for mixed Poisson. Here I'm using the MINI element: continuous linear basis functions for the thickness, and continuous linear enriched with cubic bubbles for the velocity. We could also have used a more proper $H(\text{div})$-conforming pair, like discontinuous Galerkin for the thickness and Raviart-Thomas or Brezzi-Douglas-Marini for the velocity.

cg1 = firedrake.FiniteElement("CG", "triangle", 1)
Q = firedrake.FunctionSpace(mesh, cg1)
b3 = firedrake.FiniteElement("B", "triangle", 3)
V = firedrake.VectorFunctionSpace(mesh, cg1 + b3)
Z = Q * V

The dam break problem specifies that the thickness is equal to 1 in the upper basin and 0 elsewhere. I've done a bit of extra work below because the expression for $h$ is discontinuous, and interpolating it directly gives some obvious mesh artifacts. Instead, I've chosen to project the expression and clamp it above and below.

h_expr = firedrake.conditional(x[1] >= y_1, 1.0, 0.0)
h_0 = firedrake.project(h_expr, Q)
h_0.interpolate(min_value(1, max_value(0, h_0)));
fig, ax = plt.subplots()
ax.set_aspect("equal")
colors = firedrake.tripcolor(h_0, axes=ax)
fig.colorbar(colors);
z = firedrake.Function(Z)
z_n = firedrake.Function(Z)
δt = Constant(1.0 / 32)
z_n.sub(0).assign(h_0)
z.sub(0).assign(h_0);

The test case in the Santillana and Dawson paper uses a variable friction coefficient in order to simulate the effect of increased drag when flowing over vegetation.

from firedrake import inner

k_0 = Constant(1.0)
δk = Constant(4.0)
r = Constant(0.5)
x_1 = Constant((1.5, 1.0))
x_2 = Constant((1.0, 3.5))
x_3 = Constant((2.0, 2.5))
ψ = sum(
    [
        max_value(0, 1 - inner(x - x_i, x - x_i) / r**2)
        for x_i in [x_1, x_2, x_3]
    ]
)
k = k_0 + δk * ψ
fig, axes = plt.subplots()
axes.set_aspect("equal")
firedrake.tripcolor(firedrake.interpolate(k, S), axes=axes);

The code below defines the variational form of the overland flow equations.

from firedrake import div, grad, dx

g = Constant(1.0)

h, q = firedrake.split(z)
h_n = firedrake.split(z_n)[0]
ϕ, v = firedrake.TestFunctions(Z)

F_h = ((h - h_n) / δt + div(q)) * ϕ * dx
friction = k * inner(q, q)**0.5 * q
gravity = -g * h**3 * grad(b + h)
F_q = inner(friction - gravity, v) * dx
F = F_h + F_q

We'll run into trouble if we try and use a Newton-type method on the true variational form. Notice how the $q$-$q$ block of the derivative will go to zero whenever $q = 0$. This will happen whenever the thickness is zero too. The usual hack is to put a fudge factor $\varepsilon$ into the variational form, as shown below.

ϵ = Constant(1e-3)
friction = k * (inner(q, q) + ϵ**2)**0.5 * q
gravity = -g * h**3 * grad(b + h)
F_qϵ = inner(friction - gravity, v) * dx
F_ϵ = F_h + F_qϵ

The disadvantage of is that we're then solving a slightly different physics problem. We don't have a great idea ahead of time of what $\varepsilon$ should be either. If we choose it too large, the deviation from the true problem is large enough that we can't believe the results. But if we choose it too small, the derivative will fail to be invertible.

We can take a middle course by instead just using the perturbed variational form just to define the derivative in Newton's method, but keep the true variational form as the quantity to find a root for. To do this, we'll pass the derivative of $F_\varepsilon$ as the Jacobian or J argument to the nonlinear variational problem object. Choosing $\varepsilon$ too small will still cause the solver to crash. Taking it to be too large, instead of causing us to solve a completely different problem, will now only make the solver go slower instead. We still want to make $\varepsilon$ as small as possible, but to my mind, getting the right answer slowly is a more tolerable failure mode than getting the wrong answer.

bcs = firedrake.DirichletBC(Z.sub(1), firedrake.zero(), "on_boundary")
J = firedrake.derivative(F_ϵ, z)
problem = firedrake.NonlinearVariationalProblem(F, z, bcs, J=J)

We'll have one final difficulty to overcome -- what happens if the thickness inadvertently becomes negative? There's a blunt solution that everyone uses, which is to clamp the thickness to 0 from below at every step. Clamping can work in many cases. But if you're using a Runge-Kutta method, it only assures positivity at the end of each step and not in any of the intermediate stages. We can instead formulate the whole problem, including the non-negativity constraint, as a variational inequality. Much like how some but not all variational problems arise from minimization principles, some variational inequalities arise from minimization principles with inequality constraints, like the obstacle problem. But variational inequalities are a more general class of problem than inequality-constrained minimization. Formulating overland flow as as a variational inequality is a bit of overkill for the time discretization that we're using. Nonetheless, I'll show how to do that in the following just for illustrative purposes. We first need two functions representing the upper and lower bounds for the solution. In this case, the upper bound is infinity.

from firedrake.petsc import PETSc

upper = firedrake.Function(Z)
with upper.dat.vec as upper_vec:
    upper_vec.set(PETSc.INFINITY)

The thickness is bounded below by 0, but there's no lower bound at all on the flux, so we'll set only the flux entries to negative infinity.

lower = firedrake.Function(Z)
with lower.sub(1).dat.vec as lower_vec:
    lower_vec.set(PETSc.NINFINITY)

When we want to solve variational inequalities, we can't use the usual Newton solvers in PETSc -- we have a choice between a semi-smooth Newton (vinewtonssls) and an active set solver (vinewtonrsls). I couldn't get the semi-smooth Newton solver to work and I have no idea why.

params = {
    "solver_parameters": {
        "snes_type": "vinewtonrsls",
        "ksp_type": "gmres",
        "pc_type": "lu",
    }
}

solver = firedrake.NonlinearVariationalSolver(problem, **params)

Finally, we'll run the timestepping loop. Here we pass the bounds explicitly on each call to solve.

import tqdm

final_time = 60.0
num_steps = int(final_time / float(δt))

hs = [z.sub(0).copy(deepcopy=True)]
qs = [z.sub(1).copy(deepcopy=True)]

for step in tqdm.trange(num_steps):
    solver.solve(bounds=(lower, upper))
    z_n.assign(z)

    h, q = z.split()
    hs.append(h.copy(deepcopy=True))
    qs.append(q.copy(deepcopy=True))
100%|██████████| 1920/1920 [08:22<00:00,  3.82it/s]

Movie time as always.

%%capture

from matplotlib.animation import FuncAnimation

fig, axes = plt.subplots()
axes.set_aspect("equal")
axes.get_xaxis().set_visible(False)
axes.get_yaxis().set_visible(False)

colors = firedrake.tripcolor(
    hs[0], axes=axes, vmin=0, vmax=1.0, cmap="Blues", num_sample_points=4
)
fn_plotter = firedrake.FunctionPlotter(mesh, num_sample_points=4)

def animate(h):
    colors.set_array(fn_plotter(h))

interval = 1e3 / 60
animation = FuncAnimation(fig, animate, frames=hs, interval=interval)
from IPython.display import HTML

HTML(animation.to_html5_video())

As some a posteriori sanity checking, we can evaluate how much the total water volume deviates.

volumes = np.array([firedrake.assemble(h * dx) for h in hs])
volume_error = (volumes.max() - volumes.min()) / volumes.mean()
print(f"Volume relative error: {volume_error:5.2g}")
Volume relative error: 0.013

Where a truly conservative scheme would exactly preserve the volume up to some small multiple of machine precision, we can only get global conservation up to the mesh resolution with our scheme. Instead, there are spurious "sources" at the free boundary. Likewise, there can be spurious sinks in the presence of ablation, so the sign error can go either way. This topic is covered in depth in this paper.

fig, axes = plt.subplots()
ts = np.linspace(0.0, final_time, num_steps + 1)
axes.set_xlabel("time")
axes.set_ylabel("volume ($m^3$)")
axes.plot(ts, volumes);

We can examine the fluxes after the fact in order to see where the value of $\varepsilon$ that we picked sits.

qms = [firedrake.project(inner(q, q)**0.5, Q) for q in qs]
area = firedrake.assemble(Constant(1) * dx(mesh))
qavgs = np.array([firedrake.assemble(q * dx) / area for q in qms])
print(f"Average flux: {qavgs.mean()*100**2:5.1f} cm²/s")
print(f"Fudge flux:   {float(ϵ)*100**2:5.1f} cm²/s")
Average flux: 266.5 cm²/s
Fudge flux:    10.0 cm²/s

The fudge flux is 1/25 that of the average. This is quite a bit smaller, but not so much so that we should feel comfortable with this large a perturbation to the physics equations themselves. The ability to use it only in the derivative and not in the residual is a huge help.

To wrap things up, the overland flow equations are a perfect demonstration of how trivially equivalent forms of the same physical problem can yield vastly different discretizations. Writing the system as a single parabolic PDE might seem simplest, but there are several potential zeros in the denominator that require some form of regularization. By contrast, using a mixed form introduces more unknowns and a nonlinear equation for the flux, but there's wiggle room within that nonlinear equation. This makes it much easier to come up with a robust solution procedure, even if it includes a few uncomfortable hacks like using a different Jacobian from that of the true problem. Finally, while our discretization still works ok with no positivity constraint, PETSc has variational inequality solvers that make it possible to enforce positivity.

Billiards on surfaces

In the previous post, I showed how to integrate Hamiltonian systems

$$\begin{align} \dot q & = +\frac{\partial H}{\partial p} \\ \dot p & = -\frac{\partial H}{\partial q} \end{align}$$

using methods that approximately preserve the energy. Here I'd like to look at what happens when there are non-trivial constraints

$$g(q) = 0$$

on the configuration of the system. The simplest example is the pendulum problem, where the position $x$ of the pendulum is constrained to lie on the circle of radius $L$ centered at the origin. These constraints are easy to eliminate by instead working with the angle $\theta$. A more complicated example is a problem with rotational degrees of freedom, where the angular configuration $Q$ is a 3 $\times$ 3 matrix. The constraint comes from the fact that this matrix has to be orthogonal:

$$Q^\top Q = I.$$

We could play similar tricks to the case of the pendulum and use Euler angles, but these introduce other problems when used for numerics. For this or other more complex problems, we'll instead enforce the constraints using a Lagrange multiplier $\lambda$, and working with the constrained Hamiltonian

$$H' = H - \lambda\cdot g(q).$$

We're then left with a differential-algebraic equation:

$$\begin{align} \dot q & = +\frac{\partial H}{\partial p} \\ \dot p & = -\frac{\partial H}{\partial q} + \lambda\cdot\nabla g \\ 0 & = g(q). \end{align}$$

If you feel like I pulled this multiplier trick out of a hat, you might find it more illuminating to think back to the Lagrangian formulation of mechanics, which corresponds more directly with optimization via the stationary action principle. Alternatively, you can view the Hamiltonian above as the limit of

$$H_\epsilon' = H + \frac{|p_\lambda|^2}{2\epsilon} - \lambda\cdot g(q)$$

as $\epsilon \to 0$, where $p_\lambda$ is a momentum variable conjugate to $\lambda$. This zero-mass limit is a singular perturbation, so actually building a practical algorithm from this formulation is pretty awful, but it can be pretty helpful conceptually.

For now we'll assume that the Hamiltonian has the form

$$H = \frac{1}{2}p^*M^{-1}p + U(q)$$

for some mass matrix $M$ and potential energy $U$. The 2nd-order splitting scheme to solve Hamilton's equations of motion in the absence of any constraints are

$$\begin{align} p_{n + \frac{1}{2}} & = p_n - \frac{\delta t}{2}\nabla U(q_n) \\ q_{n + 1} & = q_n + \delta t\cdot M^{-1}p_{n + \frac{1}{2}} \\ p_{n + 1} & = p_{n + \frac{1}{2}} - \frac{\delta t}{2}\nabla U(q_{n + 1}). \end{align}$$

To enforce the constraints, we'll add some extra steps where we project back onto the surface or, in the case of the momenta, onto its cotangent space. In the first stage, we solve the system

$$\begin{align} p_{n + \frac{1}{2}} & = p_n - \frac{\delta t}{2}\left\{\nabla U(q_n) - \lambda_{n + 1}\cdot \nabla g(q_n)\right\} \\ q_{n + 1} & = q_n - \delta t\cdot M^{-1}p_{n + \frac{1}{2}} \\ 0 & = g(q_{n + 1}). \end{align}$$

If we substitute the formula for $p_{n + 1/2}$ into the second equation and then substitute the resulting formula for $q_{n + 1}$ into the constraint $0 = g(q_{n + 1})$, we get a nonlinear system of equations for the new Lagrange multiplier $\lambda_{n + 1}$ purely in terms of the current positions and momenta. Having solved this nonlinear system, we can then substitute the value of $\lambda_{n + 1}$ to obtain the values of $p_{n + 1/2}$ and $q_{n + 1}$. Next, we compute the momentum at step $n + 1$, but subject to the constraint that it has to lie in the cotangent space of the surface:

$$\begin{align} p_{n + 1} & = p_{n + \frac{1}{2}} - \frac{\delta t}{2}\left\{\nabla U(q_{n + 1}) - \mu_{n + 1}\cdot \nabla g(q_{n + 1})\right\} \\ 0 & = \nabla g(q_{n + 1})\cdot M^{-1}p_{n + 1}. \end{align}$$

Once again, we can substitute the first equation into the second to obtain a linear system for the momentum-space multiplier $\mu$. Having solved for $\mu$, we can then back-substitute into the first equation to get $p_{n + 1}$. This is the RATTLE algorithm. (I'm pulling heavily from chapter 7 of Leimkuhler and Reich here if you want to see a comparison with other methods and proofs that it's symplectic.)

Surfaces

Next we have to pick an example problem to work on. To start out, we'll assume that the potential energy for the problem is 0 and focus solely on the free motion of a particle on some interesting surface. The simplest surface we could look at is the sphere:

$$g(x, y, z) = x^2 + y^2 + z^2 - R^2$$

or the torus:

$$g(x, y, z) = \left(\sqrt{x^2 + y^2} - R\right)^2 + z^2 - r^2.$$

Just for kicks, I'd like to instead look at motion on surfaces of genus 2 or higher. There are simple parametric equations for tracing out spheres and tori in terms of the trigonometric functions, so the machinery of explicitly enforcing constraints isn't really necessary. There is no such direct parameterization for higher-genus surfaces, so we'll actually need to be clever in defining the surface and in simulating motion on it. As an added bonus, the ability to trace out curves on the surface will give us a nice way of visualizing it.

To come up with an implicit equation for a higher-genus surface, we'll start with an implicit equation for a 2D curve and inflate it into 3D. For example, the equation for the torus that we defined above is obtained by inflating the implicit equation $\sqrt{x^2 + y^2} - R = 0$ for the circle in 2D. What we want to generate higher-genus surfaces is a lemniscate. An ellipse is defined as the set of points such that the sum of the distances to two foci is constant. Likewise, a lemniscate is defined as the set of points such that the product of the distances to two or more foci is constant. The Bernoulli leminscate is one such example, which traces out a figure-8 in 2D. The Bernoulli leminscate is the zero set of the polynomial

$$f(x, y) = (x^2 + y^2)^2 - a^2(x^2 - y^2)$$

and it also has the parametric equation

$$\begin{align} x & = a\frac{\sin t}{1 + \cos^2t} \\ y & = a\frac{\sin t\cdot\cos t}{1 + \cos^2t} \end{align}$$

which gives us a simple way to visualize what we're starting with.

import numpy as np
from numpy import pi as π

a = 1
t = np.linspace(0, 2 * π, 256)
xs = a * np.sin(t) / (1 + np.cos(t) ** 2)
ys = a * np.sin(t) * np.cos(t) / (1 + np.cos(t) ** 2)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_aspect("equal")
ax.plot(xs, ys);

We've loosely referred to the idea of inflating the zero-contour of a function $f(x, y)$ into 3D. The 3D function defining the desired implicit surface is

$$g(x, y, z) = f(x, y)^2 + z^2 - r^2,$$

where $r$ is a free parameter that we'll have to tune. I'm going to guess that $r < \frac{a}{\sqrt{2}}$ but it could be much less; beyond that we'll have to figure out what $r$ is by trial and error.

The code below uses the sympy software package to create a symbolic representation of the function $g$ defining our surface. Having a symbolic expression for $g$ allows us to evaluate it and its derivatives, but to actually visualize the surface we'll have to sample points on it somehow.

import sympy
x, y, z = sympy.symbols("x y z")
f = (x ** 2 + y ** 2) ** 2 - a ** 2 * (x ** 2 - y ** 2)

r = a / 6
g = f ** 2 + z ** 2 - r **2
dg = sympy.derive_by_array(g, [x, y, z])

Symbolically evaluating $g$ every time is expensive, so the code below uses the lambdify function from sympy to convert our symbolic expression into an ordinary Python function. I've added some additional wrappers so that we can pass in a numpy array of coordinates rather than the $x$, $y$, and $z$ values as separate arguments.

g_fn = sympy.lambdify([x, y, z], g, modules="numpy")
def G(q):
    return np.array([g_fn(*q)])

dg_fn = sympy.lambdify([x, y, z], dg, modules="numpy")
def dG(q):
    return np.array(dg_fn(*q)).reshape([1, 3])

One of the first algorithms for constrained mechanical systems was called SHAKE, so naturally some clever bastard had to make one called RATTLE and there's probably a ROLL out there too. The code below implements the RATTLE algorithm. You can view this as analogous to the Stormer-Verlet method, which does a half-step of the momentum solve, a full step of the position solve, and finally another half-step of the momentum solve again. In the RATTLE algorithm, we have to exercise a bit of foresight in the initial momentum half-step and position full-step in order to calculate a Lagrange multiplier to project an arbitrary position back onto the zero-contour of $g$. Solving for the position multiplier is a true nonlinear equation, whereas the final momentum half-step is just a linear equation for the momentum and its multiplier, which we've written here as $\mu$. Here we only have one constraint, so each multiplier is a scalar, which is a convenient simplification.

import tqdm
import scipy.linalg
import scipy.optimize

def trajectory(q, v, dt, num_steps, f, g, dg, progressbar=False):
    qs = np.zeros((num_steps + 1,) + q.shape)
    vs = np.zeros((num_steps + 1,) + q.shape)

    g_0 = g(q)
    λs = np.zeros((num_steps + 1,) + g_0.shape)
    μs = np.zeros((num_steps + 1,) + g_0.shape)

    def project_position(λ_0, q, v):
        def fn(λ, q, v):
            v_n = v + 0.5 * dt * (f(q) - λ @ dg(q))
            q_n = q + dt * v_n
            return g(q_n)

        result = scipy.optimize.root(fn, λ_0, args=(q, v))
        return result.x

    def project_velocity(q, v):
        J = dg(q)
        # TODO: Don't solve the normal equations, you're making Anne G sad
        A = J @ J.T
        b = J @ v
        return scipy.linalg.solve(A, b, assume_a="pos")

    qs[0] = q
    μs[0] = project_velocity(q, v)
    vs[0] = v - μs[0] @ dg(q)

    iterator = (tqdm.trange if progressbar else range)(num_steps)
    for t in iterator:
        λs[t + 1] = project_position(λs[t], qs[t], vs[t])
        v_mid = vs[t] + 0.5 * dt * (f(qs[t]) - λs[t + 1] @ dg(qs[t]))
        qs[t + 1] = qs[t] + dt * v_mid

        v_prop = v_mid + 0.5 * dt * f(qs[t + 1])
        μs[t + 1] = project_velocity(qs[t + 1], v_prop)
        vs[t + 1] = v_mid + 0.5 * dt * f(qs[t + 1]) - μs[t + 1] @ dg(qs[t + 1])

    return qs, vs, λs, μs

I'll add that this algorithm was exceedingly fiddly to implement and I had to debug about 5 or 6 times before I got it right. The sanity checking shown below was essential to making sure it was right.

def potential(q):
    return q[2]

def force(q):
    return np.array((0, 0, -1))
num_trajectories = 25
θs = 2 * π * np.linspace(0, 1, num_trajectories)
num_steps = 2000
Qs = np.zeros((num_steps + 1, 3 * num_trajectories))
Vs = np.zeros((num_steps + 1, 3 * num_trajectories))
Λs = np.zeros((num_steps + 1, num_trajectories))
for i, θ in tqdm.tqdm(enumerate(θs), total=num_trajectories):
    q = np.array((0, 0, r))
    v = np.array((np.cos(θ), np.sin(θ), 0))
    dt = 1e-2
    qs, vs, λs, μs = trajectory(q, v, dt, num_steps, force, G, dG)
    Qs[:, 3 * i : 3 * (i + 1)] = qs
    Vs[:, 3 * i : 3 * (i + 1)] = vs
    Λs[:, i] = λs.flatten()
100%|██████████| 25/25 [00:21<00:00,  1.16it/s]

As a sanity check, we'll evaluate the change in energy throughout the course of the simulation relative to the mean kinetic energy. The relative differences are on the order of 1%, which suggests that the method is doing a pretty good job. I re-ran this notebook with half the timestep and the energy deviation is cut by a factor of four, indicative of second-order convergence.

fig, ax = plt.subplots()
for i in range(num_trajectories):
    qs, vs = Qs[:, 3 * i : 3 * (i + 1)], Vs[:, 3 * i : 3 * (i + 1)]
    K = 0.5 * np.sum(vs ** 2, axis=1)
    U = np.array([potential(q) for q in qs])
    energies = K + U
    ax.plot((energies - energies[0]) / np.mean(K))

Finally, let's make a movie of the results.

from mpl_toolkits import mplot3d
from mpl_toolkits.mplot3d.art3d import Line3DCollection
from matplotlib.animation import FuncAnimation


def make_animation(
    Qs, depth=25, duration=30.0, start_width=0.1, end_width=1.5, ax=None
):
    num_steps = Qs.shape[0]
    num_particles = Qs.shape[1] // 3

    widths = np.linspace(start_width, end_width, depth)
    collections = []
    for i in range(num_particles):
        q_i = Qs[:depth, 3 * i : 3 * (i + 1)]
        points = q_i.reshape(-1, 1, 3)
        segments = np.concatenate([points[:-1], points[1:]], axis=1)
        collection = Line3DCollection(segments, linewidths=widths)
        collections.append(collection)
        ax.add_collection(collection)

    def update(step):
        start = max(0, step - depth)
        for i in range(num_particles):
            q_i = Qs[step - depth : step, 3 * i : 3 * (i + 1)]
            points = q_i.reshape(-1, 1, 3)
            segments = np.concatenate([points[:-1], points[1:]], axis=1)
            collections[i].set_segments(segments)

    interval = 1e3 * duration / num_steps
    frames = list(range(depth, num_steps))
    return FuncAnimation(
        ax.figure, update, frames=frames, interval=interval, blit=False
    )

My Riemannian geometry kung fu is weak is but I think that the geodesic flow on this surface is ergodic (see these notes).

%%capture

fig = plt.figure()
ax = fig.add_subplot(projection="3d")
ax.set_xlim((-a, a))
ax.set_ylim((-a, a))
ax.set_zlim((-a, a))
ax.set_axis_off()

animation = make_animation(Qs, depth=100, ax=ax)
from IPython.display import HTML
HTML(animation.to_html5_video())

It's also interesting to have a look at what the respective Lagrange multipliers for position and velocity are doing.

fig, ax = plt.subplots()
ts = np.linspace(0.0, num_steps * dt, num_steps + 1)
ax.plot(ts, Λs[:, 6].reshape(-1));

Note how the Lagrange multipliers aren't smooth -- they have pretty sharp transitions. If you think of the Lagrange multipliers as fictitious "forces" that push the trajectories back onto the constraint manifold, then their amplitude is probably some kind of indicator of the local curvature of the constraint surface.

More interesting now

This all worked well enough for a single particle on the surface. Now let's see what happens if we put several particles on the surface and make them interact. I'd like to find some potential that's repulsive at distances shorter than equilibrium, attractive at longer distances, and falls off to zero at infinity. We could use the Lennard-Jones potential shown in the last demo but the singularity at the origin is going to create more difficulty than necessary. Instead, I'll use a variant of the Ricker wavelet, which is plotted below.

r_e = a / 6
U_e = 0.5
r = sympy.symbols("r")
ρ = r / r_e
potential = U_e / 2 * (1 - 3 * ρ ** 2) * sympy.exp(3 / 2 * (1 - ρ ** 2))
rs = np.linspace(0.0, 3 * r_e, 61)
Us = sympy.lambdify(r, potential, modules="numpy")(rs)

fig, ax = plt.subplots()
ax.set_xlabel("distance / equilibrium")
ax.set_ylabel("potential")
ax.plot(rs / r_e, Us);

I'm using this potential just because it's convenient -- no one thinks there are real particles that act like this.

Now that we're looking at a multi-particle system, we have to evaluate the constraint on every single particle. The derivative matrix has a block structure which a serious implementation would take advantage of.

def G(q):
    return np.array([g_fn(*q[3 * i: 3 * (i + 1)]) for i in range(len(q) // 3)])

# TODO: Make it a sparse matrix
def dG(q):
    n = len(q) // 3
    J = np.zeros((n, 3 * n))
    for i in range(n):
        q_i = q[3 * i: 3 * (i + 1)]
        J[i, 3 * i: 3 * (i + 1)] = dg_fn(*q_i)

    return J

The code below calculates the total forces by summation over all pairs of particles. I added this silly extra variable force_over_r to avoid any annoying singularities at zero distance.

force = sympy.diff(potential, r)
force_over_r = sympy.lambdify(r, sympy.simplify(force / r), modules="numpy")

def F(q):
    n = len(q) // 3
    f = np.zeros_like(q)
    for i in range(n):
        q_i = q[3 * i: 3 * (i + 1)]
        for j in range(i + 1, n):
            q_j = q[3 * j: 3 * (j + 1)]
            r_ij = q_i - q_j
            r = np.sqrt(np.inner(r_ij, r_ij))
            f_ij = force_over_r(r) * r_ij

            f[3 * i: 3 * (i + 1)] += f_ij
            f[3 * j: 3 * (j + 1)] -= f_ij

    return f

To initialize the system, we'll take every 100th point from one of the trajectories that we calculated above.

skip = 100
particle = 3
q = Qs[::skip, 3 * particle : 3 * (particle + 1)].flatten()
v = np.zeros_like(q)
dt = 1e-2
N = 2000
qs, vs, λs, μs = trajectory(q, v, dt, N, F, G, dG, progressbar=True)
100%|██████████| 2000/2000 [04:41<00:00,  7.11it/s]
%%capture

fig = plt.figure()
ax = fig.add_subplot(projection="3d")
ax.set_xlim((-a, a))
ax.set_ylim((-a, a))
ax.set_zlim((-a, a))
ax.set_axis_off()

animation = make_animation(qs, depth=100, ax=ax)
HTML(animation.to_html5_video())

Some of the particles fall into each others' potential wells and become bound, developing oscillatory orbits, while others remain free. For two particles to bind, they have to have just the right momenta and relative positions; if they're moving too fast, they may scatter off of each other, but will ultimately fly off in opposite directions.

Conclusion

Enforcing constraints in solvers for Hamiltonian systems introduces several new difficulties. The basic second-order splitting scheme for unconstrained problems is pretty easy to implement and verify. While the RATTLE algorithm looks to be not much more complicated, it's very easy to introduce subtle off-by-one errors -- for example, accidentally evaluating the constraint derivative at the midpoint instead of the starting position. These mistakes manifest themselves as slightly too large deviations from energy conservation, but these deviations aren't necessarily large in any relative sense. The resulting scheme might still converge to the true solution, in which case the energy deviation will go to zero for any finite time interval. So measuring the reduction in the energy errors asymptotically as $\delta t \to 0$ probably won't catch this type of problem. It may be possible to instead calculate what the next-order term is in the Hamiltonian for the modified system using the Baker-Campbell-Hausdorff formula, but that may be pretty rough in the presence of constraints.

The implementation may be fiddly and annoying, but it is still possible to preserve much of the symplectic structure when constraints are added. The fact that structure-preserving integrators exist at all shouldn't be taken as a given. For example, there don't appear to be any simple structure-preserving adaptive integration schemes; see chapter 9 of Leimkuhler and Reich. The shallow water equations are a Hamiltonian PDE and deriving symplectic schemes that include the necessary upwinding is pretty hard.

There are several respects in which the code I wrote above is sub-optimal. For the multi-particle simulation, the constraints are applied to each particle and consequently the constraint derivative matrix $J$ is very sparse and $J\cdot J^\top$ is diagonal. For expediency's sake, I just used a dense matrix, but this scales very poorly to more particles. A serious implementation would either represent $J$ as a sparse matrix or would go matrix-free by providing routines to calculate the product of $J$ or $J^\top$ with a vector. I also implemented the projection of the momentum back onto the cotangent space by solving the normal equations, which is generally speaking a bad idea. The matrix $J\cdot J^\top$ was diagonal for our problem, so this approach will probably work out fine. For more complex problems, we may be better off solving a least-squares problem with the matrix $J^\top$ using either the QR or singular value decomposition.

Finally, I used a simple interaction potential just so we could see something interesting happen. The potential goes to a finite value at zero separation, which is a little unphysical. A much more serious deficiency was that the potential is defined using the particles' coordinates in 3D Cartesian space. Ideally, we would do everything in a way that relies as little on how the surface is embedded into Euclidean space as possible, which would mean using the geodesic distance instead.

Symplectic integrators

My dad is a physicist of the old school, and what this means is that he has to tell everyone -- regardless of their field -- that what they're doing is so simple as to not even be worth doing and that anyway physicsists could do it better. So whenever my job comes up he has to tell the same story about how he once took a problem to a numerical analyst. This poor bastard ran some code for him on a state-of-the-art computer of the time (a deer skeleton with a KT88 vacuum tube in its asshole) but the solution was total nonsense and didn't even conserve energy. Then pops realizes he could solve the Hamilton-Jacobi equation for the system exactly. Numerical analysis is for clowns.

Naturally, every time we have this conversation, I remind him that we figured out all sorts of things since then, like the fact that people who don't own land should be allowed to vote and also symplectic integrators. In this post I'll talk about the latter. A symplectic integrator is a scheme for solving Hamilton's equations of motion of classical mechanics in such a way that the map from the state at one time to the state at a later time preserves the canonical symplectic form. This is a very special property and not every timestepping scheme is symplectic. For those schemes that are symplectic, the trajectory samples exactly from the flow of a slightly perturbed Hamiltonian, which is a pretty nice result.

The two-body problem

First, we'll illustrate things on the famous two-body problem, which has the Hamiltonian

$$H = \frac{|p_1|^2}{2m_1} + \frac{|p_2|^2}{2m_2} - \frac{Gm_1m_2}{|x_1 - x_2|}$$

where $x_1$, $x_2$ are the positions of the two bodies, $m_1$, $m_2$ their masses, and $G$ the Newton gravitation constant. We can simplify this system by instead working in the coordinate system $Q = (m_1x_1 + m_2x_2) / (m_1 + m_2)$, $q = x_2 - x_1$. The center of mass $Q$ moves with constant speed, reducing the Hamiltonian to

$$H = \frac{|p|^2}{2\mu} - \frac{Gm_1m_2}{|q|}$$

where $\mu = m_1m_2 / (m_1 + m_2)$ is the reduced mass of the system. We could go on to write $q$ in polar coordinates and do several transformations to derive an exact solution; you can find this in the books by Goldstein or Klepper and Kolenkow.

Instead, we'll take the Hamiltonian above as our starting point, but first we want to make the units work out as nicely as possible. The gravitational constant $G$ has to have units of length${}^3\cdot$time${}^{-2}\cdot$mass${}^{-1}$ in order for both terms in the Hamiltonian we wrote above to have units of energy. We'd like for all the lengths and times in the problem to work out to be around 1, which suggests that we measure time in years and length in astronomic units. The depository of all knowledge tells me that, in this unit system, the gravitational constant is

$$G \approx 4\pi^2\, \text{AU}^3 \cdot \text{yr}^{-2}\cdot M_\odot^{-1}.$$

The factor of $M_\odot^{-1}$ in the gravitational constant will cancel with the corresponding factor in the Newton force law. For something like the earth-sun system, where the mass of the sun is much larger than that of the earth, the reduced mass of the system is about equal to the mass of the earth. So if we take the earth mass $M_\oplus$ as our basic mass unit, the whole system works out to about

$$H = \frac{|p|^2}{2} - \frac{4\pi^2}{|q|}.$$

Finally, in this unit system we can take the initial position of the earth to be a $(1, 0)$ AU; we know the angular velocity of the earth is about $2\pi$ AU / year, so the initial momentum is $2\pi$ AU / year. Hamilton's equations of motion are

$$\begin{align} \dot q & = +\frac{\partial H}{\partial p} = p \\ \dot p & = -\frac{\partial H}{\partial q} = -4\pi^2\frac{q}{|q|^3}. \end{align}$$

To start, we'll try out the classic explicit and implicit Euler methods first.

import numpy as np
from numpy import pi as π

q_0 = np.array([1.0, 0.0])
p_0 = np.array([0.0, 2 * π])

final_time = 3.0
num_steps = 3000
dt = final_time / num_steps
def gravitational_force(q):
    return -4 * π ** 2 * q / np.sqrt(np.dot(q, q)) ** 3
def explicit_euler(q, p, dt, num_steps, force):
    qs = np.zeros((num_steps + 1,) + q.shape)
    ps = np.zeros((num_steps + 1,) + p.shape)

    qs[0] = q
    ps[0] = p

    for t in range(num_steps):
        qs[t + 1] = qs[t] + dt * ps[t]
        ps[t + 1] = ps[t] + dt * force(qs[t])
        
    return qs, ps

We'll call out to scipy's nonlinear solver for our implementation of the implicit Euler method. In principle, scipy can solve the resulting nonlinear system of equations solely with the ability to evaluate the forces. But in order to make this approach as competitive as possible we should also provide the derivative of the forces with respect to the positions, which enables using Newton-type methods.

I = np.eye(2)

def gravitational_force_jacobian(q):
    Q = np.sqrt(np.dot(q, q))
    return -4 * π ** 2 / Q ** 3 * (I - 3 * np.outer(q, q) / Q ** 2)
from scipy.optimize import root

def implicit_euler(q, p, dt, num_steps, force, force_jacobian):
    qs = np.zeros((num_steps + 1,) + q.shape)
    ps = np.zeros((num_steps + 1,) + p.shape)

    qs[0] = q
    ps[0] = p

    def f(q, q_t, p_t):
        return q - q_t - dt * (p_t + dt * force(q))
    
    def J(q, q_t, p_t):
        return I - dt ** 2 * force_jacobian(q)
    
    for t in range(num_steps):
        result = root(f, qs[t, :], jac=J, args=(qs[t], ps[t]))
        qs[t + 1] = result.x
        ps[t + 1] = ps[t] + dt * force(qs[t + 1])
        
    return qs, ps
q_ex, p_ex = explicit_euler(
    q_0, p_0, dt, num_steps, gravitational_force
)
q_im, p_im = implicit_euler(
    q_0, p_0, dt, num_steps, gravitational_force, gravitational_force_jacobian
)
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection


def plot_trajectory(q, start_width=1.0, end_width=3.0, **kwargs):
    points = q.reshape(-1, 1, 2)
    segments = np.concatenate([points[:-1], points[1:]], axis=1)
    widths = np.linspace(start_width, end_width, len(points))
    return LineCollection(segments, linewidths=widths, **kwargs)
fig, ax = plt.subplots()
ax.set_aspect("equal")
ax.set_xlim((-1.25, +1.25))
ax.set_ylim((-1.25, +1.25))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.add_collection(plot_trajectory(q_ex, color="tab:blue", label="explicit"))
ax.add_collection(plot_trajectory(q_im, color="tab:orange", label="implicit"))
ax.legend(loc="upper right");

The explicit Euler method spirals out from what looks like to be a circular orbit at first, while the implicit Euler method spirals in. Since the gravitational potential is negative, this means that the explicit Euler scheme is gaining energy, while the implicit Euler scheme is losing energy.

def energies(qs, ps):
    kinetic = 0.5 * np.sum(ps ** 2, axis=1)
    potential = -4 * π ** 2 / np.sqrt(np.sum(qs ** 2, axis=1))
    return kinetic + potential

fig, ax = plt.subplots()
ts = np.linspace(0.0, final_time, num_steps + 1)
ax.plot(ts, energies(q_ex, p_ex), label="explicit")
ax.plot(ts, energies(q_im, p_im), label="implicit")
ax.set_xlabel("time (years)")
ax.set_ylabel("energy")
ax.legend();

If we use a slightly longer timestep, the implicit Euler method will eventually cause the earth and sun to crash into each other in the same short time span of three years. This prediction does not match observations, much as we might wish.

We could reduce the energy drift to whatever degree we desire by using a shorter timestep or using a more accurate method. But before we go and look up the coefficients for the usual fourth-order Runge Kutta method, let's instead try a simple variation on the explicit Euler scheme.

import tqdm
def semi_explicit_euler(q, p, dt, num_steps, force, progressbar=False):
    qs = np.zeros((num_steps + 1,) + q.shape)
    ps = np.zeros((num_steps + 1,) + p.shape)

    qs[0] = q
    ps[0] = p

    iterator = tqdm.trange(num_steps) if progressbar else range(num_steps)
    for t in iterator:
        qs[t + 1] = qs[t] + dt * ps[t]
        ps[t + 1] = ps[t] + dt * force(qs[t + 1])
        
    return qs, ps

Rather than use the previous values of the system state to pick the next system state, we first updated the position, then used this new value to update the momentum; we used force(qs[t + 1]) instead of force(qs[t]). This is an implicit scheme in the strictest sense of the word. The particular structure of the central force problem, however, makes the computations explicit. In fancy terms we would refer to the Hamiltonian as separable. Let's see how this semi-explicit Euler scheme does.

q_se, p_se = semi_explicit_euler(
    q_0, p_0, dt, num_steps, gravitational_force
)
fig, ax = plt.subplots()
ax.set_aspect("equal")
ax.set_xlim((-1.5, +1.5))
ax.set_ylim((-1.5, +1.5))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.add_collection(plot_trajectory(q_ex, color="tab:blue", label="explicit"))
ax.add_collection(plot_trajectory(q_im, color="tab:orange", label="implicit"))
ax.add_collection(plot_trajectory(q_se, color="tab:green", label="symplectic"))
ax.legend(loc="upper right");

The orbit of the semi-explicit or symplectic method shown in green seems to be roughly closed, which is pretty good. The most stunning feature is that the energy drift, while non-zero, is bounded and oscillatory. The amplitude of the drift is smaller than the energy itself by a factor of about one in 10,000.

fig, ax = plt.subplots()
Hs = energies(q_se, p_se)
ax.plot(ts, Hs - Hs[0], label="semi-explicit")
ax.set_xlabel("time (years)")
ax.set_ylabel("energy drift");

Just for kicks, let's try again on an elliptical orbit with some more eccentricity than what we tried here, and on the same circular orbit, for a much longer time window.

final_time = 3e2
num_steps = int(3e4)
dt = final_time / num_steps

q_0 = np.array([1.0, 0.0])
p_0 = np.array([0.0, 2 * π])
q_se, p_se = semi_explicit_euler(q_0, p_0, dt, num_steps, gravitational_force)

ϵ = 0.1
q_0 = np.array([1.0 + ϵ, 0.0])
p_0 = np.array([0.0, 2 * π])
q_el, p_el = semi_explicit_euler(q_0, p_0, dt, num_steps, gravitational_force)
fig, ax = plt.subplots()
ax.set_aspect("equal")
ax.set_xlim((-1.5, +1.5))
ax.set_ylim((-1.5, +1.5))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.add_collection(plot_trajectory(q_se, color="tab:blue", label="circular"))
ax.add_collection(plot_trajectory(q_el, color="tab:orange", label="elliptical"))
ax.legend(loc="lower right");

The orbits don't exactly trace out circles or ellipses -- the orbits precess a bit. Nonetheless, they still remain roughly closed and have bounded energy drift. For less work than the implicit Euler scheme, we got a vastly superior solution. Why is the semi-explicit Euler method so much better than the explicit or implicit Euler method?

Symplectic integrators

Arguably the most important property of Hamiltonian systems is that the energy is conserved, as well as other quantities such as linear and angular momentum. The explicit and implicit Euler methods are convergent, and so their trajectories reproduce those of the Hamiltonian system for any finite time horizon as the number of steps is increased. These guarantees don't tell us anything about how the discretized trajectories behave using a fixed time step and very long horizons, and they don't tell us anything about the energy conservation properties either. The wonderful property about semi-explicit Euler is that the map from the state of the system at one timestep to the next samples directly from the flow of a slightly perturbed Hamiltonian.

Let's try to unpack that statement a bit more. A fancy way of writing Hamilton's equations of motion is that, for any observable function $f$ of the total state $z = (q, p)$ of the system,

$$\frac{\partial f}{\partial t} = \{f, H\}$$

where $\{\cdot, \cdot\}$ denotes the Poisson bracket. For the simple systems described here, the Poisson bracket of two functions $f$ and $g$ is

$$\{f, g\} = \sum_i\left(\frac{\partial f}{\partial q_i}\frac{\partial g}{\partial p_i} - \frac{\partial f}{\partial p_i}\frac{\partial g}{\partial q_i}\right).$$

We recover the usual Hamilton equations of motion by substituting the positions and momenta themselves for $f$. In general, the Poisson bracket can be any bilinear form that's antisymmetric and satisfies the Leibniz and Jacobi identities. In a later demo, I'll look at rotational kinematics, where the configuration space is no longer flat Euclidean space but the Lie group SO(3). The Poisson bracket is rightfully viewed as a 2-form in this setting. Leaving this complications aside for the moment, the evolution equation in terms of brackets is especially nice in that it allows us to easily characterize the conserved quantities: any function $f$ such that $\{f, H\} = 0$. In particular, due to the antisymmetry of the bracket, the Hamiltonian $H$ itself is always conserved.

Solving Hamilton's equations of motion forward in time gives a map $\Phi_t$ from the initial to the final state. The nice part about this solution map is that it obeys the semi-group property: $\Phi_s\circ\Phi_t = \Phi_{s + t}$. In the same way that we can think of a matrix $A$ generating the solution map $e^{tA}$ of the linear ODE $\dot z = Az$, we can also think of the solution map for Hamiltonian systems as being generated by the Poisson bracket with the Hamiltonian:

$$\Phi_t = \exp\left(t\{\cdot, H\}\right)$$

where $\exp$ denotes the exponential map. This isn't a rigorous argument and to really make that clear I'd have to talk about diffeomorphism groups of manifolds. Just believe me and read Jerrold Marsden's books if you don't.

Now comes the interesting part. Suppose we want to solve the linear ODE

$$\dot z = (A + B)z.$$

We'd like to find a way to break down solving this problem into separately solving ODEs defined by $A$ and $B$. It isn't possible to split the problem exactly because, for matrices, $\exp\left(t(A + B)\right)$ is not equal to $\exp(tA)\exp(tB)$ unless $A$ and $B$ commute. But, for small values of $\delta t$, we can express the discrepancy in terms of the commutate $[A, B] = AB - BA$ of the matrices:

$$\exp(\delta t\cdot A)\exp(\delta t\cdot B) = \exp\left(\delta t(A + B) + \frac{\delta t^2}{2}[A, B] + \ldots\right)$$

where the ellipses denote terms of higher order in $\delta t$. Exactly what goes in the higher-order terms is the content of the Baker-Campbell-Hausdorff (BCH) formula. This reasoning is what leads to splitting methods for all kinds of different PDEs. For example, you can show that splitting the solution of an advection-diffusion equation into an explicit step for the advective part and an implicit step for the diffusive part works with an error of order $\mathscr{O}(\delta t)$ using the BCH formula.

The clever part about the analysis of symplectic methods is that we can play a similar trick for Hamiltonian problems (if we're willing to wave our hands a bit). Suppose that a Hamiltonian $H$ can be written as

$$H = H_1 + H_2$$

where exactly solving for the flow of each Hamiltonian $H_1$, $H_2$ is easy. The most obvious splitting is into kinetic and potential energies $K$ and $U$. Integrating the Hamiltonian $K(p)$ is easy because the momenta don't change all -- the particles continue in linear motion according to what their starting momenta were. Integrating the Hamiltonian $U(q)$ is also easy because, while the momenta will change according to the particles' initial positions, those positions also don't change. To write it down explicitly,

$$\Phi^K_t\left(\begin{matrix}q \\ p\end{matrix}\right) = \left(\begin{matrix}q + t\frac{\partial K}{\partial p} \\ p\end{matrix}\right)$$

and

$$\Phi^U_t\left(\begin{matrix}q \\ p\end{matrix}\right) = \left(\begin{matrix}q \\ p - t\frac{\partial U}{\partial q}\end{matrix}\right)$$

Each of these Hamiltonian systems by itself is sort of silly, but the composition of maps $\Phi^U_{\delta t}\circ \Phi^K_{\delta t}$ gives an $\mathscr{O}(\delta t$)-accurate approximation to $\Phi^{K + U}_{\delta t}$ by the BCH formula. Now if we keep up the analogy and pretend like we can apply the BCH formula to Hamiltonian flows exactly, we'd formally write that

$$\exp\left(\delta t\{\cdot, H_1\}\right)\exp\left(\delta t\{\cdot, H_2\}\right) = \exp\left(\delta t\{\cdot, H_1 + H_2\} + \frac{\delta t^2}{2}\left\{\cdot, \{H_1, H_2\}\right\} + \ldots \right).$$

In other words, it's not just that using the splitting scheme above is giving us a $\mathscr{O}(\delta t)$-accurate approximation to the solution $q(t), p(t)$, it's that the approximate solution is sampled exactly from integrating the flow of the perturbed Hamiltonian

$$H' = H + \frac{\delta t}{2}\{H_1, H_2\} + \mathscr{O}(\delta t^2).$$

All of the things that are true of Hamiltonian systems generally are then true of our numerical approximations. For example, they still preserve volume in phase space (Liouville's theorem)); have no stable or unstable equilibrium points, only saddles and centers; and typically have roughly bounded trajectories.

Using the BCH formula to compute the perturbed Hamiltonian helps us design schemes of even higher order. For example, the scheme that we're using throughout in this post is obtained by taking a full step of the momentum solve followed by a full step of the position solve. We could eliminate the first-order term in the expansion by taking a half-step of momentum, a full step of position, followed by a half-step of momentum again:

$$\Psi = \Phi^K_{\delta t / 2}\Phi^U_{\delta t}\Phi^K_{\delta t / 2},$$

i.e. a symmetric splitting. This gives a perturbed Hamiltonian that's accurate to $\delta t^2$ instead:

$$H' = H + \frac{\delta t^2}{24}\left(2\{U, \{U, K\}\} - \{K, \{K, U\}\}\right) + \mathscr{O}(\delta t^4)$$

This scheme is substantially more accurate and also shares a reversibility property with the true problem.

Making all of this analysis really rigorous requires a bit of Lie algebra sorcery that I can't claim to understand at any deep level. But for our purposes it's sufficient to know that symplectic methods like semi-explicit Euler sample exactly from some perturbed Hamiltonian, which is likely to have bounded level surfaces in phase space if the original Hamiltonian did. This fact gives us stability guarantees that are hard to come by any other way.

Molecular dynamics

The two-body gravitational problem is all well and good, but now let's try it for a more interesting and complex example: the motion of atoms. One of the simplest models for interatomic interactions is the Lennard-Jones (LJ) potential, which has the form

$$U = \epsilon\left(\left(\frac{R}{r}\right)^{12} - 2\left(\frac{R}{r}\right)^6\right).$$

The potential is repulsive at distances less than $R$, attractive at distances between $R$ and $2R$, and pretty much zero at distances appreciably greater than $2R$, with a well depth of $\epsilon$. The LJ potential is spherically symmetric, so it's not a good model for polyatomic molecules like water that have a non-trivial dipole moment, but it's thought to be a pretty approximation for noble gases like argon. We'll work in a geometrized unit system where $\epsilon = 1$ and $R = 1$. The code below calculates the potential and forces for a system of several Lennard-Jones particles.

ϵ = 1.0
R = 1.0


def lennard_jones_potential(q):
    U = 0.0
    n = len(q)
    for i in range(n):
        for j in range(i + 1, n):
            z = q[i] - q[j]
            ρ = np.sqrt(np.dot(z, z)) / R
            U += ϵ / ρ ** 6 * (1 / ρ ** 6 - 2)

    return U


def lennard_jones_force(q):
    fs = np.zeros_like(q)
    n = len(q)
    for i in range(n):
        for j in range(i + 1, n):
            z = q[i] - q[j]
            ρ = np.sqrt(np.dot(z, z)) / R
            f = -12 * ϵ / R ** 2 / ρ ** 8 * (1 - 1 / ρ ** 6) * z
            fs[i] += f
            fs[j] -= f

    return fs

This code runs in $\mathscr{O}(n^2)$ for a system of $n$ particles, but the Lennard-Jones interaction is almost completely negligible for distances greater than $3R$. There are approximation schemes that use spatial data structures like quadtrees to index the positions of all the particles and lump the effects of long-range forces. These schemes reduce the overall computational burden to $\mathscr{O}(n\cdot\log n)$ and are a virtual requirement to running large-scale simulations.

For the initial setup, we'll look at a square lattice of atoms separated by a distance $R$. We'll start out with zero initial velocity. If you were to imagine an infinite or periodic lattice of Lennard-Jones atoms, a cubic lattice should be stable. The points immediately to the north, south, east, and west on the grid are exactly at the equilibrium distance, while the forces between an atom and its neighbors to the northwest and southeast should cancel. For this simulation, we won't include any periodicity, so it's an interesting question to see if the cubic lattice structure remains even in the presence of edge effects.

num_rows, num_cols = 10, 10
num_particles = num_rows * num_cols

q = np.zeros((num_particles, 2))
for i in range(num_rows):
    for j in range(num_cols):
        q[num_cols * i + j] = (R * i, R * j)
        
p = np.zeros((num_particles, 2))

I've added a progress bar to the simulation so I can see how fast it runs. Each iteration usually takes about the same time, so after about 10 or so you can tell whether you should plan to wait through the next cup of coffee or until next morning.

dt = 1e-2
num_steps = 2000

qs, ps = semi_explicit_euler(
    q, p, dt, num_steps, force=lennard_jones_force, progressbar=True
)
100%|██████████| 2000/2000 [02:52<00:00, 11.58it/s]

And now for some pretty animations.

%%capture
from matplotlib.animation import FuncAnimation


fig, ax = plt.subplots()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.set_xlim((qs[:, :, 0].min(), qs[:, :, 0].max()))
ax.set_ylim((qs[:, :, 1].min(), qs[:, :, 1].max()))
ax.set_aspect("equal")
points = ax.scatter(qs[0, :, 0], qs[0, :, 1], animated=True)

def update(timestep):
    points.set_offsets(qs[timestep, :, :])

num_steps = len(qs)
fps = 60
animation = FuncAnimation(fig, update, num_steps, interval=1e3 / fps)
from IPython.display import HTML
HTML(animation.to_html5_video())

The cubic lattice is unstable -- the particles eventually rearrange into a hexagonal lattice. We can also see this if we plot the potential energy as a function of time. Around halfway into the simulation, the average potential energy suddenly drops by about $\epsilon / 5$.

ts = np.linspace(0, num_steps * dt, num_steps)
Us = np.array([lennard_jones_potential(q) for q in qs]) / num_particles
fig, ax = plt.subplots()
ax.set_xlabel("time")
ax.set_ylabel("potential energy")
ax.plot(ts, Us);

By way of a posteriori sanity checking, we can see that the total energy wasn't conserved exactly, but the deviations are bounded and the amplitude is much smaller than the characteristic energy scale $\epsilon$ of the problem.

Ks = 0.5 * np.sum(ps ** 2, axis=(1, 2)) / num_particles
Hs = Us + Ks

fig, ax = plt.subplots()
ax.set_xlabel("time")
ax.set_ylabel("energy")
ax.plot(ts, Us, label="potential")
ax.plot(ts, Hs, label="total")
ax.legend();

Conclusion

An introductory class in numerical ODE will show you how to construct convergent discretization schemes. Many real problems, however, have special structure that a general ODE scheme may or may not preserve. Hamiltonian systems are particularly rich in structure -- energy and phase space volume conservation, reversibility. Some very special discretization schemes preserve this structure. In this post, we focused only on the very basic symplectic Euler scheme and hinted at the similar but more accurate Störmer-Verlet scheme. Another simple symplectic method is the implicit midpoint rule

$$\frac{z_{n + 1} - z_n}{\delta t} = f\left(\frac{z_n + z_{n + 1}}{2}\right).$$

There are of course higher-order symplectic schemes, for example Lobatto-type Runge Kutta methods.

We showed a simulation of several particles interacting via the Lennard-Jones potential, which is spherically symmetric. Things get much more complicated when there are rotational degrees of freedom. The rotational degrees of freedom live not in flat Euclidean space but on the Lie group SO(3), and the angular momenta in the Lie algebra $\mathfrak{so}(3)$. More generally, there are specialized methods for problems with constraints, such as being a rotation matrix, or being confined to a surface.

If you want to learn more, my favorite references are Geometric Numerical Integration by Hairer, Lubich, and Wanner and Simulating Hamiltonian Dynamics by Leimkuhler and Reich.

ADMM

In the previous post, I showed how to use Moreau-Yosida regularization for inverse problems with non-smooth regularization functionals. Specifically, we were looking at the total variation functional

$$R(q) = \alpha\int_\Omega|\nabla q|dx$$

as a regularizer, which promotes solutions that are piecewise constant on sets with relatively nice-looking boundary curves. Rather than try to minimize this functional directly, we instead used a smooth approximation, which in many cases is good enough. The smooth approximation is based on penalty-type methods, and one distinct disadvantage of penalty methods is that they tend to wreck the conditioning of the problem. This poor conditioning manifests itself as a multiple order-of-magnitude imbalance in the different terms in the objective. To minimize the objective accurately, say through a line search procedure, you have to do so with an accuracy that matches the magnitude of the smallest term.

In another previous post on Nitsche's method, I looked at how the pure quadratic penalty method compared to the augmented Lagrangian method for imposing Dirichlet boundary conditions. Here we'll proceed in a similar vein: what happens if we go from using a pure penalty method to using an augmented Lagrangian scheme?

Generating the exact data

We'll use the exact same problem as in the previous post on total variation regularization -- a random Fourier series for the boundary data, a quadratic blob for the forcing, and a discontinuous conductivity coefficient.

import firedrake
mesh = firedrake.UnitSquareMesh(32, 32, diagonal='crossed')
Q = firedrake.FunctionSpace(mesh, 'CG', 2)
V = firedrake.FunctionSpace(mesh, 'CG', 2)
import numpy as np
from numpy import random, pi as π
x = firedrake.SpatialCoordinate(mesh)

rng = random.default_rng(seed=1)
def random_fourier_series(std_dev, num_modes, exponent):
    from firedrake import sin, cos
    A = std_dev * rng.standard_normal((num_modes, num_modes))
    B = std_dev * rng.standard_normal((num_modes, num_modes))
    return sum([(A[k, l] * sin(π * (k * x[0] + l * x[1])) +
                 B[k, l] * cos(π * (k * x[0] + l * x[1])))
                / (1 + (k**2 + l**2)**(exponent/2))
                for k in range(num_modes)
                for l in range(int(np.sqrt(num_modes**2 - k**2)))])
from firedrake import interpolate, project
g = interpolate(random_fourier_series(1.0, 6, 1), V)
from firedrake import inner, max_value, conditional, Constant
a = -Constant(4.5)
r = Constant(1/4)
ξ = Constant((0.4, 0.5))
q_true = interpolate(a * conditional(inner(x - ξ, x - ξ) < r**2, 1, 0), Q)
firedrake.trisurf(q_true);
b = Constant(6.)
R = Constant(1/4)
η = Constant((0.7, 0.5))
f = interpolate(b * max_value(0, 1 - inner(x - η, x - η) / R**2), V)
from firedrake import exp, grad, dx, ds
k = Constant(1.)
h = Constant(10.)
u_true = firedrake.Function(V)
v = firedrake.TestFunction(V)
F = (
    (k * exp(q_true) * inner(grad(u_true), grad(v)) - f * v) * dx +
    h * (u_true - g) * v * ds
)
opts = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'lu',
        'pc_factor_mat_solver_type': 'mumps',
    },
}
firedrake.solve(F == 0, u_true, **opts)
firedrake.trisurf(u_true);

Generating the observational data

To create the synthetic observations, we'll once again need to call out directly to PETSc to get a random field with the right error statistics when using a higher-order finite element approximation.

ξ = firedrake.Function(V)
n = len(ξ.dat.data_ro)
ξ.dat.data[:] = rng.standard_normal(n)
from firedrake import assemble
from firedrake.petsc import PETSc
ϕ, ψ = firedrake.TrialFunction(V), firedrake.TestFunction(V)
m = inner(ϕ, ψ) * dx
M = assemble(m, mat_type='aij').M.handle
ksp = PETSc.KSP().create()
ksp.setOperators(M)
ksp.setUp()
pc = ksp.pc
pc.setType(pc.Type.CHOLESKY)
pc.setFactorSolverType(PETSc.Mat.SolverType.PETSC)
pc.setFactorSetUpSolverType()
L = pc.getFactorMatrix()
pc.setUp()
area = assemble(Constant(1) * dx(mesh))
z = firedrake.Function(V)
z.dat.data[:] = rng.standard_normal(n)
with z.dat.vec_ro as Z:
    with ξ.dat.vec as Ξ:
        L.solveBackward(Z, Ξ)
        Ξ *= np.sqrt(area / n)
 = u_true.dat.data_ro[:]
signal = .max() - .min()
signal_to_noise = 50
σ = firedrake.Constant(signal / signal_to_noise)

u_obs = u_true.copy(deepcopy=True)
u_obs += σ * ξ

Solution via ADMM

To motivate ADMM, it helps to understand the augmented Lagrangian method. There are two basic ways to solve equality-constrained optimization problems: the Lagrange multiplier method and the penalty method. The augmented Lagrangian method uses both a Lagrange multiplier and a quadratic penalty, which astonishingly works much better than either the pure Lagrange multiplier or penalty methods. ADMM is based on using the augmented Lagrangian method with a consensus constraint to split out non-smooth problems. Specifically, we want to find a minimizer of the functional

$$J(q) = E(G(q) - u^o) + \alpha\int_\Omega|\nabla q|\, dx$$

where $E$ is the model-data misfit and $G$ is the solution operator for the problem

$$F(u, q) = 0.$$

If $F$ is continuously differentiable with respect to both of its arguments and the linearization with respect to $u$ is invertible, then the implicit function theorem in Banach spaces tells us that such a solution operator $u = G(q)$ exists. Minimizing this functional $J(q)$ is more challenging than the case where we used the $H^1$-norm to regularize the problem because the total variation functional is non-smooth. In the previous post, we showed how to work around this challenge by using Moreau-Yosida regularization. You can motivate Moreau-Yosida regularization by introducing an auxiliary vector field $v$ and imposing the constraint that $v = \nabla q$ by a quadratic penalty method. We can then solve for $v$ exactly because we know analytically what the proximal operator for the 1-norm is. The resulting functional upon eliminating $v$ is the Moreau-Yosida regularized form.

The idea behind ADMM is to instead use an augmented Lagrangian -- combining both the classical method of Lagrange multipliers with the quadratic penalty method -- to enforce the constraint that $v = \nabla q$. This gives us the augmented Lagrangian

$$\begin{align} L_\rho(q, v, \mu) & = E(G(q) - u^o) + \alpha\int_\Omega|v|\, dx \\ & \qquad + \rho\alpha^2\int_\Omega\left(\mu\cdot(\nabla q - v) + \frac{1}{2}|\nabla q - v|^2\right)dx. \end{align}$$

If you've seen ADMM before, you might notice that we've scaled the penalty parameter a bit. We put in an extra factor of $\alpha^2$ with the penalty term $\|\nabla q - v\|^2$ and an extra factor of $\rho\alpha^2$ with the Lagrange multiplier $\mu$ so that it has the same units as both $\nabla q$ and $v$. In order to highlight the connection with Moreau-Yosida regularization, we'll do a slight rearrangement by completing the square:

$$\begin{align} L_\rho(q, v, \mu) & = E(G(q) - u^o) + \alpha\int_\Omega|v|\, dx \\ & \qquad + \frac{\rho\alpha^2}{2}\int_\Omega\left\{|\nabla q + \mu - v|^2 - |\mu|^2\right\}dx. \end{align}$$

If we look at the parts of the Lagrangian involving only $v$, we get something that looks a lot like Moreau-Yosida regularization of the $L^1$ norm, only the argument to be evaluated at is $\nabla q + \mu$. Likewise, if we look at the parts of the Lagrangian involving only $q$, we have something that looks exactly like $H^1$ regularization, only with the regularization centered around $v - \mu$ instead of around 0.

Each iteration of the method will proceed in three steps:

  1. Minimize $L_\rho$ with respect to $q$ only. This step is very similar to using the squared $H^1$-norm for regularization but for the fact that we're not regularizing around 0, but rather around $v - \mu$.
  2. Minimize $L_\rho$ with respect to $v$: $$v \leftarrow \text{soft threshold}_{\rho\alpha}(\nabla q + \mu)$$
  3. Perform a gradient ascent step for $\mu$: $$\mu \leftarrow \mu + \nabla q - v$$

The final gradient ascent step for $\mu$ seemed a little bit mysterious to me at first. Ultimately the whole problem is a saddle point problem for $\mu$, so intuitively it made sense to me that gradient ascent would move you in the right direction. I still felt in my gut that you should have to do some more sophisticated kind of update for $\mu$. So I fell down a deep rabbit hole trying to understanding why the augmented Lagrangian method actually works. The usual references like Nocedal and Wright are deafeningly silent on this issue; the only source I could find that gave a decent explanation was Nonlinear Programming by Bertsekas. You could just believe me that step 3 is the right thing to do, but the reasons why are not at all trivial!

from firedrake_adjoint import (
    Control,
    ReducedFunctional,
    minimize,
)

First, we'll solve the forward problem with our blunt initial guess for the solution. Under the hood, pyadjoint will tape this operation, thus allowing us to correctly calculate the derivative of the objective functional later.

q = firedrake.Function(Q)
u = firedrake.Function(V)
F = (
    (k * exp(q) * inner(grad(u), grad(v)) - f * v) * dx +
    h * (u - g) * v * ds
)
forward_problem = firedrake.NonlinearVariationalProblem(F, u)
forward_solver = firedrake.NonlinearVariationalSolver(forward_problem, **opts)
forward_solver.solve()
firedrake.trisurf(u);

These variables will store the values of the auxiliary field $v$ and the multiplier $\mu$ that enforces the constraint $v = \nabla q$. An interesting question that I haven't seen addressed anywhere in the literature on total variation regularization is what finite element basis to use for $v$ and $\mu$. Here we're using the usual continuous Lagrange basis of degree 1, which seems to work. I've also tried this with discontinuous basis functions and hte estimates for $v$ and $\mu$ seem to have oscillatory garbage. I have a hunch that some bases won't work because they fail to satisfy the LBB conditions. I have another hunch that it would be fun to repeat this experiment with some kind of H(curl)-conforming element for $v$ and $\mu$, but for now we'll just stick with CG(1).

Δ = firedrake.VectorFunctionSpace(mesh, 'CG', 1)
v = firedrake.Function(Δ)
μ = firedrake.Function(Δ)

At first, $v$ and $\mu$ are zero. So when we start the iteration, the first value of $q$ that we'll compute is just what we would have found had we used $H^1$-regularization. We'll start with the ADMM penalty of $\rho = 1$ and we'll use the same regularization penalty of $\alpha = 1 / 20$ that we used in the previous demo.

α = Constant(5e-2)
ρ = Constant(1.0)

Next we'll execute a few steps of the ADMM algorithm. I picked the number of iterations out of a hat. You should use an actual stopping criterion if you care about doing this right.

from firedrake import sqrt
import tqdm

qs = [q.copy(deepcopy=True)]
vs = [v.copy(deepcopy=True)]
μs = [μ.copy(deepcopy=True)]

num_steps = 15
for step in tqdm.trange(num_steps):
    # Step 1: Solve the inverse problem for q.
    J = assemble(
        0.5 * ((u - u_obs) / σ)**2 * dx +
        0.5 * ρ * α**2 * inner(grad(q) + μ - v, grad(q) + μ - v) * dx
    )

    q_opt = minimize(ReducedFunctional(J, Control(q)))
    q.assign(q_opt)
    forward_solver.solve()

    # Step 2: soft thresholding for v.
    z = grad(q) + μ
    expr = conditional(
        (ρ * α) ** 2 * inner(z, z) < 1,
        Constant((0, 0)),
        (1 - 1 / (ρ * α * sqrt(inner(z, z)))) * z
    )
    v.project(expr)
    
    # Step 3: gradient ascent for μ.
    μ.project(μ + grad(q) - v)
    
    qs.append(q.copy(deepcopy=True))
    vs.append(v.copy(deepcopy=True))
100%|██████████| 15/15 [23:25<00:00, 93.67s/it] 

The resulting computed value of $q$ does a great job capturing all of the sharp edges in $q$, despite the fact that we use a penalty parameter of $\rho = 1$. When we used the pure penalty method in the previous demo, we had to take the penalty parameter to be on the order of 400. The inverse solver took much longer to converge and we still missed some of the edges.

import matplotlib.pyplot as plt
fig, axes = plt.subplots()
colors = firedrake.tripcolor(q, axes=axes)
fig.colorbar(colors);

The split variable $v$ matches the gradient of $q$ quite well.

fig, axes = plt.subplots()
colors = firedrake.tripcolor(v, axes=axes)
fig.colorbar(colors);

Finally, let's look at the relative changes in successive iterates of $q$ in the 1-norm at each step in order to get an idea of how fast the method converges.

δs = [
    assemble(abs(q2 - q1) * dx) / assemble(abs(q2) * dx)
    for q1, q2 in zip(qs[:-1], qs[1:])
]
fig, axes = plt.subplots()
axes.set_yscale('log')
axes.set_ylabel('Relative change in $q$')
axes.set_xlabel('Iteration')
axes.plot(δs);

Some iterations seem to hardly advance the solution at all, but when taken in aggregate this looks like the typical convergence of a first-order method.

Discussion

Much like the pure penalty method, the alternating direction method of multipliers offers a way to solve certain classes of non-smooth optimization problem by instead solving a sequence of smooth ones. ADMM, by introducing an explicit Lagrange multiplier estimate to enforce the consensus constraint, offers much faster convergence than the pure penalty method and the size of the penalty parameter does not need to go off to infinity. As a consequence, each of the smooth optimization problems that we have to solve has much better conditioning.

For this test case, we were able to take the penalty parameter $\rho$ to be equal to 1 from the outset and still obtain a good convergence rate. For more involved problems it's likely that we would instead have to test for convergence with a given value of $\rho$ and increase it by some factor greater than 1 if need be. Scaling this penalty parameter by an appropriate power of the regularization parameter $\alpha$ ahead of time makes it dimensionless. This property is especially advantageous for realistic problems but it requires you to know something about the objective you're minimizing.

There are obvious grid imprinting artifacts in the solution that we computed. To remedy this undesirable feature, we could use a mesh adaptation strategy that would refine (preferably anisotropically) along any sharp gradients in $q$.

Finally, we motivated ADMM by assuming that we could take an $L^2$-norm difference of $v$ and $\nabla q$. The idealized, infinite-dimensional version of the problem assumes only that $q$ lives in the space $BV(\Omega)$ of functions of bounded variation. The gradient of such a function is a finite, signed Borel measure, and thus may not live in $L^2$ at all. Hintermüller et al. (2014) gives an alternative formulation based on the dual problem, which has the right coercivity properties for Moreau-Yosida regularization to make sense. It's possible that the form I presented here falls afoul of some subtle functional analysis and that the solutions exhibit strong mesh dependence under refinement. Alternatively, it's possible that, while $v$ and $\nabla q$ only live in the space of finite signed measures and thus are not square integrable, their difference $\nabla q - v$ does live in $L^2$. Investigating this more will have to wait for another day.

Langevin Monte Carlo

In previous posts, we've shown how to solve inverse problems to estimate the coefficients $q$ in a PDE from measured values of the solution $u$. The computational problem that we aimed to solve was to minimize the functional

$$J(u, q) = E(u) + R(q)$$

where $E$ is a misfit functional and $R$ the regularization functional. The solution was subject to a constraint

$$F(u, q) = 0$$

where $F$ is some partial differential equation. Provided that $F$ is nice enough we can calculate derivatives of this functional using the adjoint method and then apply some sort of descent method. Another important fact we'll use below is that the PDE has a solution operator $G(q)$. We can then think of this in terms of a reduced objective functional, which in a gross abuse of notation we'll also write as $J$:

$$J(q) = E(G(q)) + R(q).$$

We can also give this problem a statistical interpretation. The functional

$$\pi(q) = Z^{-1}e^{-J(q)}$$

is the Bayesian posterior density after having taken the measurements given the prior information that $R(q)$ averages to some value. (The scalar $Z$ is a normalizing constant that we effectively cannot calculate.) The maximum a posteriori or MAP estimator is the value of $q$ that maximizes the posterior density. But maximizing the posterior density is the same as minimizing its negative logarithm. This gets us right back to our original problem of minimizing the objective functional $J$.

Computing the most likely value of the parameters given the observations provides valuable information when the measurement errors are normally distributed and when the forward map $G$ is linear or only mildly nonlinear. For many problems, however, the measurement errors are not normal or the forward map is not approximately linear. The MAP estimator might still be useful, but it also might not be all that informative and it might not even exist.

In this post, I'll describe a procedure by which we can instead draw random samples from the posterior distribution. Assuming that our sampling procedure is ergodic, we can approximate expectations of an arbitrary functional $f$ as a weighted sum:

$$\langle f\rangle = \int f(q)\,\mathrm{d}\pi(q) \approx \frac{1}{N}\sum_n f(q_n).$$

It doesn't make sense to seek the most likely value of the parameters when there are many local maxima of the posterior density or when the posterior is so flat near the maximum that it can't be located with reasonable accuracy. In this circumstances, we're better off accepting the uncertainty for what it is and factoring this into what we do with our answer(s).

The classic approach for sampling from the posterior distribution is the Metropolis algorithm, a form of Markov-chain Monte Carlo that I'll assume you're familiar with. One of the challenges in using this algorithm is that it scales poorly to dimensions much more than 10 or 12, and if you're doing PDE problems the dimensions are often in the thousands. Here we'll try here an approach called the Metropolis-adjusted Langevin algorithm or MALA. The key innovation of MALA is to incorporate drift towards higher-probability regions of parameter space into the proposal density. To be more specific, consider the Itô diffusion

$$\dot q = \frac{1}{2}M^{-1}\mathrm{d}\log \pi(q) + M^{-\frac{1}{2}}\dot B,$$

where $M$ is some symmetric and positive-definite linear operator and $B$ is standard Brownian motion. Under a few reasonable hypotheses on $\pi$, the limiting distribution of this process is $\pi$. Of course we can't in general solve this SDE exactly, but we can discretize and I'll describe how below.

Generating the exact data

We'll use the same input data and exact solution as the previous demo -- a random trigonometric polynomial for the boundary data and a spike in the conductivity to make it depart appreciably from the equivalent solution for homogeneous data.

import firedrake
Lx = firedrake.Constant(1.0)
Ly = firedrake.Constant(1.0)
nx, ny = 32, 32
mesh = firedrake.RectangleMesh(nx, ny, float(Lx), float(Ly), diagonal='crossed')
Q = firedrake.FunctionSpace(mesh, 'CG', 2)
V = firedrake.FunctionSpace(mesh, 'CG', 2)
import numpy as np
from numpy import random, pi as π
from firedrake import sin, cos, Constant
x = firedrake.SpatialCoordinate(mesh)

def random_fourier_series(std_dev, num_modes, exponent, seed=1):
    rng = random.default_rng(seed=seed)
    A = std_dev * rng.standard_normal((num_modes, num_modes))
    B = std_dev * rng.standard_normal((num_modes, num_modes))
    
    expr = Constant(0)
    for k in range(num_modes):
        for l in range(int(np.sqrt(num_modes**2 - k**2))):
            ϕ = π * (k * x[0] / Lx + l * x[1] / Ly)
            Z = 1 + (k**2 + l**2)**(exponent / 2)
            a_kl = Constant(A[k, l] / Z)
            b_kl = Constant(B[k, l] / Z)
            expr = expr + a_kl * sin(ϕ) + b_kl * cos(ϕ)
    
    return expr
from firedrake import interpolate
g = interpolate(random_fourier_series(1.0, 6, 1, seed=1), V)
from firedrake import inner, min_value, max_value, Constant
a = -Constant(8.)
r = Constant(1/4)
y = Constant((0.4, 0.5))
q_true = interpolate(a * max_value(0, 1 - inner(x - y, x - y) / r**2), Q)
firedrake.trisurf(q_true);
b = Constant(6.)
R = Constant(1/4)
η = Constant((0.7, 0.5))
f = interpolate(b * max_value(0, 1 - inner(x - η, x - η) / R**2), V)
from firedrake import exp, grad, dx, ds
k = Constant(1.)
h = Constant(10.)
u_true = firedrake.Function(V)
v = firedrake.TestFunction(V)
F = (
    (k * exp(q_true) * inner(grad(u_true), grad(v)) - f * v) * dx +
    h * (u_true - g) * v * ds
)
opts = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'lu',
        'pc_factor_mat_solver_type': 'mumps'
    }
}
firedrake.solve(F == 0, u_true, **opts)
firedrake.trisurf(u_true);

Generating the observational data

In the inverse problem tutorial, we only had to generate a single realization of spatial white noise. Here we'll have to generate many realizations, so we'll wrap this up into a class that will store all the data structures we need. We'll also abstract a bit over what the covariance operator is. In the inverse problem tutorial we used the mass matrix as the covariance operator, and we will do that again here to generate the observational data. As we'll see later, we'll also use this to generate the random noise used in the sampling algorithm, which will have non-trivial spatial correlations. The important thing that the code below does is to apply the inverse of only one of the Cholesky factors instead of both. We had to use the PETSc API to do this, since it isn't natively supported at the Firedrake level and arguably it shouldn't be.

from firedrake import assemble
from firedrake.petsc import PETSc
area = assemble(Constant(1) * dx(mesh))

class NoiseGenerator:
    def __init__(
        self,
        function_space,
        covariance=None,
        generator=random.default_rng()
    ):
        if covariance is None:
            ϕ = firedrake.TrialFunction(function_space)
            ψ = firedrake.TestFunction(function_space)
            covariance = inner(ϕ, ψ) * dx

        M = assemble(covariance, mat_type='aij').M.handle
        ksp = PETSc.KSP().create()
        ksp.setOperators(M)
        ksp.setUp()
        
        pc = ksp.pc
        pc.setType(pc.Type.CHOLESKY)
        pc.setFactorSolverType(PETSc.Mat.SolverType.PETSC)
        pc.setFactorSetUpSolverType()
        L = pc.getFactorMatrix()
        pc.setUp()
        
        self.rng = generator
        self.function_space = function_space
        self.preconditioner = pc
        self.cholesky_factor = L
        
        self.rhs = firedrake.Function(self.function_space)
        self.noise = firedrake.Function(self.function_space)
        
    def __call__(self):
        z, ξ = self.rhs, self.noise
        N = len(z.dat.data_ro[:])
        z.dat.data[:] = self.rng.standard_normal(N)

        L = self.cholesky_factor
        with z.dat.vec_ro as Z:
            with ξ.dat.vec as Ξ:
                L.solveBackward(Z, Ξ)
                Ξ *= np.sqrt(area / N)
                
        return ξ.copy(deepcopy=True)
white_noise_generator = NoiseGenerator(
    function_space=V,
    generator=random.default_rng(seed=1066)
)
 = u_true.dat.data_ro[:]
signal = .max() - .min()
signal_to_noise = 50
σ = firedrake.Constant(signal / signal_to_noise)

u_obs = u_true.copy(deepcopy=True)
ξ = white_noise_generator()
u_obs += σ * ξ
firedrake.trisurf(u_obs);

Sampling

The sampling procedure that we'll use works by approximating paths of an Itô diffusion. Many of the papers you'll come across on MALA or on diffusion processes assume a constant volatility matrix or inverse covariance of the added noise. I've included a factor $M$, which we'll refer to as a preconditioning matrix. The kinds of posterior distributions we encounter in data assimilation properties are often highly anisotropic in parameter space. Making a good choice of preconditioning operator is a virtual necessity if for efficiently sampling from the posterior.

The theoretically optimal choice is to $M$ to be the second derivative or Hessian of the negative log-posterior. Pyadjoint can apply a Hessian action for us, and we can then use the conjugate gradient method with an appropriate preconditioner to solve for the drift. But to calculate the diffusion we need an inverse square root of the Hessian operator, and that will require us to call in to SLEPc.

The worst thing we could do is nothing because of the mesh-dependence problems that I described in the previous post on inverse problems. The finite element mass matrix is the absolute least you can do without committing a criminal offense -- it represents the map from the dual of $L^2$ back to $L^2$ itself.

To do better than just the mass matrix, we can try to come up with a linear operator that will roughly recover the spectral asymptotics of the Hessian but which we can more easily express as a sparse matrix -- the Hessian will be dense. Then we can just reuse the noise sampler class that I wrote above to compute the Cholesky factorization and generate random variables with this operator as its inverse covariance matrix. Here we'll use an operator $M$ with the weak form

$$\langle M\phi, \psi\rangle = \int_\Omega\left(\phi\cdot\psi + \ell^2\nabla\phi\cdot\nabla\psi\right)dx$$

where $\ell$ is some length scale, which I'll take to be the diameter of the domain. If the parameters $q$ live in the Sobolev space $H^1(\Omega)$, then $M$ is the canonical map from $H^1(\Omega)$ to its dual. The length scale $\ell$ is there to make the units work out right.

ϕ, ψ = firedrake.TestFunction(Q), firedrake.TrialFunction(Q)
 = firedrake.sqrt(Lx * Ly)
M = (ϕ * ψ + **2 * inner(grad(ϕ), grad(ψ))) * dx

diffusion_generator = NoiseGenerator(
    function_space=Q,
    covariance=M,
    generator=random.default_rng(1453)
)

We'll again start with the very blunt initial guess that the log-conductivity is 0 to initialize the problem.

q = firedrake.Function(Q)
u = firedrake.Function(V)
F = (
    (k * exp(q) * inner(grad(u), grad(v)) - f * v) * dx +
    h * (u - g) * v * ds
)
firedrake.solve(F == 0, u, **opts)

As I alluded to earlier, our proposal distribution for the MCMC algorithm is based on integrating the SDE

$$\dot q = -\frac{1}{2}M^{-1}\mathrm{d}J(q) + M^{-\frac{1}{2}}\dot B$$

where $B$ is standard Brownian motion. The Euler-Maruyama scheme to integrate this equation for a single step of length $\delta t$ is

$$q^* = q - \frac{\delta t}{2}M^{-1}\mathrm{d}J(q) + \delta t^{\frac{1}{2}}M^{-\frac{1}{2}}\delta B.$$

But a naive integration of the SDE may have a different limiting distribution from the SDE itself, and it may even diverge. To fix this problem, we'll use a Metropolis-style accept/reject step. Having generated a proposal $q^*$, we'll accept it with probability

$$\alpha = \min\left\{1, \frac{\pi(q^*)\cdot\rho(q^*\to q)}{\pi(q)\cdot\rho(q\to q^*)}\right\}$$

where $\pi$ is the posterior density and $\rho$ is the transition density. We don't know the normalizing factor in the posterior density, but by taking the ratio of the two this factor cancels:

$$\ln\frac{\pi(q^*)}{\pi(q)} = J(q) - J(q^*).$$

In the classic random walk Metropolis-Hastings algorithm, the ratio of the transition density from the current state to the proposed state usually cancels because the transition density is symmetric -- the chance of going from $q$ to $q^*$ is equal to the chance of going from $q^*$ back to $q$. The algorithm that we're using here lacks this symmetry because of the gradient flow term in the proposal density. The transition density can be written as:

$$\rho(q \to q^*) = Z^{-1}\exp\left\{-\frac{1}{2\tau}\left\|q^* - q + \frac{1}{2}\tau M^{-1}dJ(q)\right\|_M^2\right\}$$

Once again, the normalizing factor $Z$ is not directly computable but we're only interested in ratios:

$$2\tau\ln\frac{\rho(q^* \to q)}{\rho(q \to q^*)} = \left\|q^* - q + \frac{1}{2}\tau M^{-1}dJ(q)\right\|_M^2 - \left\|q - q^* + \frac{1}{2}\tau M^{-1}dJ(q^*)\right\|_M^2$$

This means that we need variables to store both the parameter guess, the PDE solution, and the derivative of the log-posterior at both the current state and the proposal at every stage of the iteration.

from firedrake_adjoint import Control, ReducedFunctional

α = Constant(1e-1)
drift_solver = firedrake.LinearSolver(assemble(M), **opts)
z = firedrake.Function(Q)
z_n = firedrake.Function(Q)
q_n = firedrake.Function(Q)
u_n = firedrake.Function(V)

J = 0.5 * (
    ((u - u_obs) / σ)**2 +
    α**2 * inner(grad(q), grad(q))
) * dx

 = Control(q)
 = ReducedFunctional(assemble(J), )
dĴ_dq = .derivative()
drift_solver.solve(z, dĴ_dq)

I've tuned the step length for the Euler-Maruyama integration to get a good accept/reject ratio and (as we'll see later) to get a good decorrelation time.

τ = firedrake.Constant(5 / 6)

At the end of the loop I've also added a call to clear the tape for adjoint annotation. If you don't clear the tape, the simulation can eat up more and more memory as it goes on.

from firedrake import sqrt, interpolate, replace, energy_norm
import firedrake_adjoint
import tqdm

us = []
qs = []
Js = []

rng = random.default_rng(seed=42)

num_accepted = 0
num_rejected = 0

num_samples = 3000
progress = tqdm.trange(num_samples)
for sample in progress:
    δB = diffusion_generator()
    q_n = interpolate(q - 0.5 * τ * z + sqrt(τ) * δB, Q)
    F_n = replace(F, {q: q_n, u: u_n})
    firedrake.solve(F_n == 0, u_n, **opts)
    
    J_n = replace(J, {q: q_n, u: u_n})
     = Control(q_n)
     = ReducedFunctional(assemble(J_n), )
    dĴ_dq = .derivative()
    drift_solver.solve(z_n, dĴ_dq)

    δln_π = float(assemble(J - J_n))
    δq = interpolate(q_n - q + 0.5 * τ * z, Q)
    δq_n = interpolate(q - q_n + 0.5 * τ * z_n, Q)
    δln_ρ = float(
        assemble(energy_norm(M, δq) - energy_norm(M, δq_n)) / (2 * τ)
    )
    alpha = np.exp(δln_ρ + δln_π)
    roll = rng.uniform(0, 1)
    accept = roll > 1 - alpha
    num_accepted += accept
    num_rejected += not accept
    progress.set_description(
        f'Accepted, rejected: {num_accepted}, {num_rejected} |'
    )
    
    if accept:
        q.assign(q_n)
        u.assign(u_n)
        z.assign(z_n)

    qs.append(q.copy(deepcopy=True))
    us.append(u.copy(deepcopy=True))
    Js.append(float(assemble(J_n)))
    
    firedrake_adjoint.get_working_tape().clear_tape()
Accepted, rejected: 1795, 1205 |: 100%|██████████| 3000/3000 [22:37<00:00,  2.21it/s]

This run took only a few minutes, but I've overheard statistics grad students talk about Monte Carlo simulations runing overnight or even for several days. Geyer's introduction to the Handbook of Markov Chain Monte Carlo -- which I highly recommend reading, it's available for free from the book website -- implies that overnight is a bare minimum to show due dilligence.

Another nice feature of this algorithm is that it's easy to run several chains in parallel. We haven't done this here, but comparing results across many chains is common to several approaches for a posteriori sanity checking.

Now that we have our samples, it's time to actually analyze the results and do something with them.

Burn-in

The probability distribution of the samples eventually converges to the posterior density, but in the initial stages of the chain, the distribution can be very far off from the limiting value. We can see that by plotting the values of the negative log posterior (or the objective functional if you're still in the optimization frame of mind).

import matplotlib.pyplot as plt
fig, axes = plt.subplots()
axes.plot(Js);

The objective functional decreases almost monotonically for the first 200 steps or so; the chain is effectively doing deterministic gradient descent towards a more probable region of parameter space. This phenomenon is referred to as the burn-in. It's common to discard all of the samples from the burn-in phase. From here on out we'll forget these iterations entirely.

num_burn_steps = 200
qs = qs[num_burn_steps:]

We could have avoided this ad-hoc, manual step by taking a few iterations of an optimization procedure to approximate the maximizer of the posterior distribution and starting the Markov chain from there.

The trace plot we showed above helps us diagnose where burn-in occurs, but doesn't give a very good idea of the distribution of $J$ values. For that we can make a histogram.

fig, axes = plt.subplots()
axes.hist(Js, 30)
axes.set_xlabel('$J$ value')
axes.set_ylabel('count');

The histogram shows some signs of bimodality, which is definitely worth investigating. There might be a local maximum of the probability distribution separate from the true mode.

Having tossed the initial deterministic junk, we can take a guess at what the posterior mean of the distribution is.

q_avg = firedrake.Function(Q)
q_avg.dat.data[:] = np.mean(
    np.array([q.dat.data_ro[:] for q in qs]), axis=0
)

Note how similar this looks to the maximum a posteriori estimate that we obtained in the previous demo.

firedrake.trisurf(q_avg);

Sample size

After we've discarded the samples from the burn-in phase, the successive states of the Markov chain are still correlated and this begs the question of how many independent samples we actually obtained. Assessing the effective sample size turns out to be surprisingly subtle. The idealized, mathematical definition of the effective sample size, assuming that we could calculate the correlations exactly, is

$$N_{\text{eff}} = \frac{N}{1 + 2\sum_{k = 1}^\infty r_k}$$

where $r_k$ is the autocorrelation at lag $k$. Let's suppose for now that we're only interested in calculating the effective number of samples of the negative log-posterior $J$. This is a scalar quantity so it's much cheaper to compute on, which is good because our first few attempts are going to go down in flames. The autocorrelation is defined as

$$r_k = \langle (J(q_{i + k}) - \mu)(J(q_i) - \mu)\rangle / \sigma^2$$

where $\mu$ and $\sigma$ are the mean and standard deviation of $J$. We could try to approximate the autocorrelation from the samples themselves.

Js = Js[num_burn_steps:]
J_avg = np.mean(Js)
J_std = np.sqrt(np.mean((Js - J_avg)**2))
print(f'Mean, standard deviation of J: {J_avg:5.3f}, {J_std:5.3f}')

autocorr = np.zeros_like(Js)
autocorr[0] = 1.0
for k in range(1, len(Js)):
    autocorr[k] = np.mean(
        (Js[:-k] - J_avg) * (Js[k:] - J_avg)
    ) / J_std**2
Mean, standard deviation of J: 2.959, 0.797

The empirical autocorrelation goes totally bonkers at large lags because there aren't enough samples spaced that far apart to adequately estimate the value of the autocorrelation function.

fig, axes = plt.subplots()
axes.plot(autocorr)
axes.set_ylabel('autocorrelation')
axes.set_xlabel('lag');

Just to really drive home how nonsensical this is, let's calculate the denominator in the expression for the effective sample size.

1 + 2 * np.sum(autocorr[1:])
-37.79065108326468

wat.

We could try an alternative formula instead that weights every autocorrelation value by $1 / N$. This weighting is guaranteed to produce a positive-definite autocorrelation function (see this SO answer), unlike the form I used above. Feels a little ad-hoc to me but ok?

autocorr = np.zeros_like(Js)
autocorr[0] = 1.0
N = len(Js)
for k in range(1, N):
    autocorr[k] = np.sum(
        (Js[:-k] - J_avg) * (Js[k:] - J_avg)
    ) / J_std**2 / N

Slightly less horrific-looking.

fig, axes = plt.subplots()
axes.plot(autocorr)
axes.set_xlabel('lag')
axes.set_ylabel('autocorrelation');

Now let's see what the correlation time is.

1 + 2 * np.sum(autocorr[1:])
-6.661338147750939e-16

This would imply that we have a few quadrillion independent samples which, again, I find to be highly suspect.

Hopefully by this point you're convinced that estimating the effective sample size from the samples themselves is very not trivial, so now I'll refer you to a great paper called A Comparison of Methods for Computing Autocorrelation Time by Madeleine Thompson. One of the methods she tested is to use batch means and this is what we'll use below. The idea of batch means is to divide the $n$ samples into $m$ batches, and to compare the overall sample variance with the variance of the batch means. For the method to converge, the number of batches and the size of each batch has to go to infinity. A common choice is to take $m = \sqrt{n}$, or some other power.

batch_size = int(np.sqrt(N / 2))
num_batches = int(N / batch_size)
batch_means = np.zeros(num_batches)
for index in range(num_batches):
    batch_means[index] = np.mean(
        Js[index * batch_size: (index + 1) * batch_size]
    )
    
batch_std = np.sqrt(np.mean((batch_means - J_avg)**2))
correlation_time = batch_size * batch_std / J_std

print(f"""
    batch size:     {batch_size}
    total std dev:  {J_std:5.3f}
    batch std dev:  {batch_std:5.3f}
    corr time:      {correlation_time:5.3f}
""")
    batch size:     37
    total std dev:  0.797
    batch std dev:  0.324
    corr time:      15.061

This suggests that about one out of every 12 Monte Carlo samples is effectively independent, which is pretty good. We might be able to improve this further by tuning the stepsize or by using a better choice of preconditioner $M$.

There are other ways to estimate effective sample size and the Thompson paper I linked above does a nice comparison of them on several example problems. Nonetheless, I'm surprised to have found only one paper that compared them. It's been cited about 50 times but never published in a peer-reviewed journal. I'm also surprised to have come across so many papers that don't report how they computed the effective sample size or even what it was for their simulation.

A few other subtle points are worth mentioning here too. The effective sample size can differ depending on exactly what quantity you're measuring. For example, let's see what the effective sample size is for the cosine of the angle between the estimated parameter and the true value.

from firedrake import norm
def angle(p, q):
    cos_θ = float(assemble(p * q * dx) / (norm(p) * norm(q)))
    return np.arccos(cos_θ)

angles = np.array([angle(q, q_true) for q in qs])
angles_avg = np.mean(angles)
angles_std = np.sqrt(np.mean((angles - angles_avg)**2))
print(f'Mean, stddev of ∠(q, q_true): {angles_avg:5.3f}, {angles_std:5.3f}')
Mean, stddev of ∠(q, q_true): 0.489, 0.044
batch_means = np.zeros(num_batches)
for index in range(num_batches):
    batch_means[index] = np.mean(
        angles[index * batch_size: (index + 1) * batch_size]
    )
    
batch_std = np.sqrt(np.mean((batch_means - angles_avg)**2))
correlation_time = batch_size * batch_std / angles_std

print(f"""
    batch size:     {batch_size}
    total std dev:  {angles_std:5.3f}
    batch std dev:  {batch_std:5.3f}
    corr time:      {correlation_time:5.3f}
""")
    batch size:     37
    total std dev:  0.044
    batch std dev:  0.015
    corr time:      12.347

We have just a bit more effective samples of the angle between $q$ and the true value as we do of $J(q)$.

A second and perhaps more surprising fact is that, if the autocorrelation function really is negative, you can have more effective samples than total samples. It's possible to specially design the proposal distribution to make the autocorrelation function at odd lags negative; this is the idea behind antithetic updates. There's a nice discussion of the unintuitive behaviors of effective sample size here.

Reflection

I've worked on inverse problems for several years from a purely optimization point of view. I knew about Monte Carlo approaches, but I'd written them off for a while because I had the impression that they scale very badly to high dimensions. So I was very happy to hear about approaches like MALA or Hamiltonian Monte Carlo that overcome the dimensionality curse. I'd spent a lot of time beating my head against the wall trying (and failing) to implement more robust optimization algorithms and preconditioners to solve an inverse problem that doesn't even have a unique solution. So I find it more appealing on a philosophical level to confront the possible non-uniqueness head-on by sampling directly from the posterior distribution. This isn't to say that MCMC sampling is a cakewalk. Here I touched on how estimating the effective sample size is a poorly-documented but necessary part of the workflow. Things get worse when you consider the likelihood of sampling distinct regions of parameter space that are separated by a large potential barrier. Even with these faults, I find the sampling approach to be a big improvement both in implementation feasibility and in what you can learn from the results over the optimization approach.

Rosenbrock schemes

In the previous demo, we looked at a few spatial and temporal discretizations of the nonlinear shallow water equations. One of the challenging parts about solving systems of hyperbolic PDE like the shallow water equations is choosing a timestep that satisfies the Courant-Friedrichs-Lewy condition. You can pick a good timestep ahead of time for a linear autonomous system. A nonlinear system, on the other hand, might wander into weird parts of phase space where the characteristic wave speeds are much higher. You might be able to pick a good timestep from the outset, but it's likely to be overly conservative and waste loads of compute time. The tyranny of the CFL condition is the reason why it's such a common grumble among practitioners that ocean models explode if you look at them sideways.

All of the timestepping schemes we used in the previous demo were Runge-Kutta methods, which look something like this:

$$z_{n + 1} = z_n + \delta t\cdot \sum_ib_ik_i$$

where $b_i$ are weights and the stages $k_i$ are defined as

$$k_i = f\left(z_n + \delta t\sum_ja_{ij}k_j\right).$$

For the method to be explicit, we would need that $a_{ij} = 0$ if $j \ge i$. You can find all the conditions for a Runge-Kutta method to have a certain order of accuracy in time in books like Hairer and Wanner.

Implicit Runge-Kutta schemes can eliminate many of the frustrating stability issues that occur with explicit schemes. Implicit methods can use timesteps that far exceed the CFL-stable timestep. But they introduce the added complexity of having to solve a nonlinear system at every timestep. What globalization strategy will you use for Newton's method? What preconditioner will you use in solving the associated linear systems? These are all decisions you didn't have to make before. It's possible to reduce some of the pain and suffering by using schemes for which $a_{ii}$ can be nonzero but $a_{ij} = 0$ if $j > 0$ -- these are the diagonally-implicit Runge-Kutta schemes. Rather than have to solve a gigantic nonlinear system for all of the stages $k_i$ at once, you only have to solve a sequence of nonlinear systems for each stage.

The idea behind Rosenbrock methods is to perform only a single iteration of Newton's method for the nonlinear system defining the Runge-Kutta stages, rather than actually solve that system to convergence. There are two heuristic justifications for Rosenbrock schemes. First, a scheme like implicit Euler is only first-order accurate in time anyway, so there isn't much reason to do a really accurate nonlinear system solve as part of a fairly crude timestepping scheme. Second, for a timestep that isn't too much larger than the characteristic timescale of the problem, the current system state is probably either in the quadratic convergence basin for Newton's method or at least fairly close.

More general Rosenbrock schemes follow from this idea. The best reference I've found is one of the original papers on the subject, Kaps and Rentrop (1979). This paper shows more general schemes in this family, derives the order conditions for the various weights and parameters, and perhaps most importantly derives an embedded Rosenbrock scheme that can be used for adaptive timestep control. Here we'll show one of the most basic schemes, which comes from taking a single Newton step for the implicit midpoint rule.

Setup

All of this is copied from the previous demo, so I'll give only cursory explanations.

import firedrake
from firedrake import Constant
g = Constant(9.81)
I = firedrake.Identity(2)

The following functions compute symbolic representations of the various shallow water fluxes.

from firedrake import inner, grad, dx
def cell_flux(z):
    Z = z.function_space()
    h, q = firedrake.split(z)
    ϕ, v = firedrake.TestFunctions(Z)
    
    f_h = -inner(q, grad(ϕ)) * dx

    F = outer(q, q) / h + 0.5 * g * h**2 * I
    f_q = -inner(F, grad(v)) * dx

    return f_h + f_q
from firedrake import avg, outer, dS
def central_facet_flux(z):
    Z = z.function_space()
    h, q = firedrake.split(z)
    ϕ, v = firedrake.TestFunctions(Z)
    
    mesh = z.ufl_domain()
    n = firedrake.FacetNormal(mesh)

    f_h = inner(avg(q), ϕ('+') * n('+') + ϕ('-') * n('-')) * dS

    F = outer(q, q) / h + 0.5 * g * h**2 * I
    f_q = inner(avg(F), outer(v('+'), n('+')) + outer(v('-'), n('-'))) * dS
    
    return f_h + f_q
from firedrake import sqrt, max_value
def lax_friedrichs_facet_flux(z):
    Z = z.function_space()
    h, q = firedrake.split(z)
    ϕ, v = firedrake.TestFunctions(Z)
    
    mesh = h.ufl_domain()
    n = firedrake.FacetNormal(mesh)
    
    c = abs(inner(q / h, n)) + sqrt(g * h)
    α = avg(c)
    
    f_h = -α * (h('+') - h('-')) * (ϕ('+') - ϕ('-')) * dS
    f_q = -α * inner(q('+') - q('-'), v('+') - v('-')) * dS

    return f_h + f_q
def topographic_forcing(z, b):
    Z = z.function_space()
    h = firedrake.split(z)[0]
    v = firedrake.TestFunctions(Z)[1]

    return -g * h * inner(grad(b), v) * dx

For an explicit time discretization and a DG method in space, we can use an ILU preconditioner with a block Jacobi inner preconditioner and this will exactly invert the DG mass matrix.

block_parameters = {
    'ksp_type': 'preonly',
    'pc_type': 'ilu',
    'sub_pc_type': 'bjacobi'
}

parameters = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'fieldsplit',
        'fieldsplit_0': block_parameters,
        'fieldsplit_1': block_parameters
    }
}
from firedrake import (
    NonlinearVariationalProblem as Problem,
    NonlinearVariationalSolver as Solver
)

class SSPRK3:
    def __init__(self, state, equation):
        z = state.copy(deepcopy=True)
        dt = firedrake.Constant(1.0)
        
        num_stages = 3
        zs = [state.copy(deepcopy=True) for stage in range(num_stages)]
        Fs = [equation(z), equation(zs[0]), equation(zs[1])]
        
        Z = z.function_space()
        w = firedrake.TestFunction(Z)
        forms = [
            inner(zs[0] - z, w) * dx - dt * Fs[0],
            inner(zs[1] - (3 * z + zs[0]) / 4, w) * dx - dt / 4 * Fs[1],
            inner(zs[2] - (z + 2 * zs[1]) / 3, w) * dx - 2 * dt / 3 * Fs[2]
        ]
        
        problems = [Problem(form, zk) for form, zk in zip(forms, zs)]
        solvers = [Solver(problem, **parameters) for problem in problems]
        
        self.state = z
        self.stages = zs
        self.timestep = dt
        self.solvers = solvers
    
    def step(self, timestep):
        self.timestep.assign(timestep)
        for solver in self.solvers:
            solver.solve()
        self.state.assign(self.stages[-1])

We'll create some auxiliary functions to actually run the simulation and create an animation of it.

import tqdm

def run_simulation(solver, final_time, num_steps, output_freq):
    hs, qs = [], []
    qs = []
    pbar = tqdm.trange(num_steps)
    for step in pbar:
        if step % output_freq == 0:
            h, q = solver.state.split()
            hmin, hmax = h.dat.data_ro.min(), h.dat.data_ro.max()
            pbar.set_description(f'{hmin:5.3f}, {hmax:5.3f}')
            hs.append(h.copy(deepcopy=True))
            qs.append(q.copy(deepcopy=True))

        solver.step(timestep)
    
    return hs, qs
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

def make_animation(hs, b, timestep, output_freq, **kwargs):
    fig, axes = plt.subplots()
    axes.set_aspect('equal')
    axes.set_xlim((0.0, Lx))
    axes.set_ylim((0.0, Ly))
    axes.get_xaxis().set_visible(False)
    axes.get_yaxis().set_visible(False)
    η = firedrake.project(hs[0] + b, hs[0].function_space())
    colors = firedrake.tripcolor(
        hs[0], num_sample_points=1, axes=axes, **kwargs
    )
    
    def animate(h):
        η.project(h + b)
        colors.set_array(η.dat.data_ro[:])

    interval = 1e3 * output_freq * timestep
    animation = FuncAnimation(fig, animate, frames=hs, interval=interval)
    
    plt.close(fig)
    return HTML(animation.to_html5_video())

Rosenbrock scheme

The implementation of the Rosenbrock scheme is fairly similar to the other timestepping methods we've shown before, but we have an extra term in the variational problem describing the linearization of the dynamics. We're also making the initializer take some extra arguments for solver parameters. When we were using explicit schemes, there was really only one sane choice of solver parameters because the matrix we had to invert was just a DG mass matrix. Here, the choice of iterative solvers and preconditioners can become much more involved, as we'll show later.

from firedrake import derivative
class Rosenbrock:
    def __init__(self, state, equation, solver_parameters=None):
        z = state.copy(deepcopy=True)
        F = equation(z)
        
        z_n = z.copy(deepcopy=True)
        Z = z.function_space()
        w = firedrake.TestFunction(Z)

        dt = firedrake.Constant(1.0)

        dF = derivative(F, z, z_n - z)
        problem = Problem(
            inner(z_n - z, w) * dx - dt / 2 * dF - dt * F,
            z_n
        )
        solver = Solver(problem, solver_parameters=solver_parameters)

        self.state = z
        self.next_state = z_n
        self.timestep = dt
        self.solver = solver
        
    def step(self, timestep):
        self.timestep.assign(timestep)
        self.solver.solve()
        self.state.assign(self.next_state)

Demonstration

We'll use the same input data and function spaces as before -- BDFM(2) for the momentum and DG(1) for the thickness.

nx, ny = 32, 32
Lx, Ly = 20., 20.
mesh = firedrake.PeriodicRectangleMesh(
    nx, ny, Lx, Ly, diagonal='crossed'
)

x = firedrake.SpatialCoordinate(mesh)
lx = 5.
y = Constant((lx, lx))
r = Constant(2.5)

h_0 = Constant(1.)
δh = Constant(1/16)
h_expr = h_0 + δh * max_value(0, 1 - inner(x - y, x - y) / r**2)

y = Constant((3 * lx, 3 * lx))
δb = Constant(1/4)
b = δb * max_value(0, 1 - inner(x - y, x - y) / r**2)
DG1 = firedrake.FunctionSpace(mesh, family='DG', degree=1)
BDFM2 = firedrake.FunctionSpace(mesh, family='BDFM', degree=2)
Z = DG1 * BDFM2
z0 = firedrake.Function(Z)
z0.sub(0).project(h_expr - b);
def F(z):
    return (
        cell_flux(z) +
        central_facet_flux(z) +
        lax_friedrichs_facet_flux(z) -
        topographic_forcing(z, b)
    )
SSPRK(3)

To get a baseline solution, we'll use the SSPRK(3) scheme from before.

solver = SSPRK3(z0, F)
final_time = 10.0
timestep = 5e-3
num_steps = int(final_time / timestep)
output_freq = 10
hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
0.786, 1.028: 100%|██████████| 2000/2000 [14:26<00:00,  2.31it/s]
make_animation(
    hs, b, timestep, output_freq, shading='gouraud', vmin=0.96, vmax=1.04
)
energies_ssprk3 = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * (h + b)**2) * dx
    )
    for h, q in zip(hs, qs)
]

fig, axes = plt.subplots()
axes.plot(energies_ssprk3);

So that we have a number to compare against for later, we can calculate the total energy drift from the beginning to the end of the simulation:

energies_ssprk3[-1] - energies_ssprk3[0]
0.48253018551554305
Rosenbrock

Now let's see how the new scheme scheme fares.

solver = Rosenbrock(z0, F)

We can use a much longer timestep than we could with explicit methods.

final_time = 10.0
timestep = 50e-3
num_steps = int(final_time / timestep)
output_freq = 1
hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
0.784, 1.035: 100%|██████████| 200/200 [03:56<00:00,  1.18s/it]

A subtle but interesting feature you can see in this animation is that the spurious wave emanating from the bump at the bed has a much smaller magnitude with the Rosenbrock scheme than with any of the explicit schemes.

make_animation(
    hs, b, timestep, output_freq, shading='gouraud', vmin=0.96, vmax=1.04
)

The energy drift is cut by a factor of 5 compared to using an explicit scheme. On top of that, we were able to achieve it using much larger timesteps than were CFL-stable before, and as a consequence the overall time for the simulation is shorter.

energies_rosenbrock = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * (h + b)**2) * dx
    )
    for h, q in zip(hs, qs)
]

fig, axes = plt.subplots()
axes.plot(energies_rosenbrock);
energies_rosenbrock[-1] - energies_rosenbrock[0]
0.0936657250274493

Conclusion

In the previous post, we showed some of the difficulties associated with solving the shallow water equations. The two biggest problems we ran into were getting a CFL-stable timestep and controlling energy drift. Rosenbrock schemes almost eliminate stability problems and decimate the drift as well. While they are substantially more expensive for a single timestep, there are a lot of gains to be had by using a better preconditioner. On top of that, we can gain other efficiencies by approximating the linearized dynamics with a matrix that's easier to invert.

Inverse problems

In previous posts, we've seen how to solve elliptic PDE, sometimes with constraints, assuming we know everything about the coefficients and other input data. Some problems in geophysics and engineering involve going backwards. We have direct measurements of some field that we know is the solution of a PDE, and from that data we want to estimate what the coefficients were. This is what's called an inverse problem. For example, knowing the inflow rate of groundwater and the degree to which the soil and bedrock are porous, we can calculate what the hydraulic head will be by solving the Poisson equation; this is the forward problem. The inverse problem would be to estimate the porosity from measurements of the hydraulic head.

We've already seen many of the techniques that we'll use to solve inverse problems and in this post I'll demonstrate them. Inverse problems can be expressed through PDE-constrained optimization, and the biggest challenge is calculating the gradient of the objective functional with respect to the input parameters. There's a systematic and practical algorithm to do this called the adjoint method. The UFL language for variational forms preserves enough of the high-level semantics of what problem you're solving, and consequently it's possible to generate all of the code necessary to implement the adjoint method solely from the code for the weak form. The package pyadjoint does this and even won a Wilkinson Prize for numerical software. In the following, I'll use pyadjoint to both calculate derivatives and solve optimization problems, but it's instructive to roll your own adjoint method and solvers if you haven't done it before.

The problem

Suppose that the physics we're interested in can be described by the Poisson problem. We want to estimate is the conductivity coefficient and we have measurements of the solution $u$. Rather than solve for the conductivity $K$ itself, I'll instead assume that the field $q$ that we want to infer is the logarithm of the conductivity:

$$K = ke^q,$$

where $k$ is some real constant. The reason for this change of variables is to guarantee that the conductivity is positive, a necessary condition which can be challenging to enforce through other means. For our problem, we'll include some internal sources $f$. By way of boundary conditions, we'll assume that the solution is adjusts with some exchange coefficient $h$ to an external field $g$ (these are Robin boundary conditions). The weak form of this equation is

$$\begin{align} \langle F(u, q), v\rangle = & \int_\Omega\left(ke^q\nabla u\cdot\nabla v - fv\right)dx \\ & \qquad\qquad + \int_{\partial\Omega}h(u - g)v\, ds \end{align}$$

I'll assume that we know the sources, external field, and exchange coefficient accurately. The quantity that we want to minimize is the mean-square misfit of the solution $u$ with some observations $u^o$:

$$E(u) = \frac{1}{2}\int_\Omega\left(\frac{u - u^o}{\sigma}\right)^2dx,$$

where $\sigma$ is the standard deviation of the measurement errors in $u^o$. For realistic problems we might want to consider more robust measures of solution quality, like the 1-norm, but for demonstrative purposes the square norm is perfectly fine.

To make our problem as realistic as possible, we'll create a set of synthetic observations that's been polluted from the true value with random noise. The presence of noise introduces an additional challenge. The map from the parameters $q$ to the observations $u$ involves solving an elliptic PDE and thus tends to give an output field $u$ that is smoother than the input field $q$. (You can actually write down an analytical form of the linearization of this map that makes the smoothing property evident.) For many practical problems, however, the measurement errors are spatial white noise, which have equal power at all frequencies. If we put white noise through the inverse of a smoothing operator, we'll end up amplifying the high-frequency modes and the estimated field $q$ will be polluted with spurious osillations. To remove these unphysical features, we'll also include some metric of how oscillatory the inferred field is, which in our case will be

$$R(q) = \frac{1}{2}\int_\Omega|\nabla q|^2dx.$$

This is called the regularization functional. Depending on the problem you may want to use a different regularization functional, and at the end of this post I'll give an example of when you might want to do that.

All together now

The quantity we want to minimize is the functional

$$J(u, q) = E(u) + \alpha^2 R(q),$$

subject to the constraint that $u$ and $q$ are related by the PDE, which we'll write in abstract form as $F(u, q) = 0$. The parameter $\alpha$ is a length scale that determines how much we want to regularize the inferred field. Making a good choice of $\alpha$ is a bit of an art form best left for another day; in the following demonstration I'll pick a reasonable value and leave it at that. The adjoint method furnishes us with a way to calculate the derivative of $J$, which will be an essential ingredient in any minimization algorithm.

To be more explicit about enforcing those constraints, we can introduce a Lagrange multiplier $\lambda$. We would then seek a critical point of the Lagrangian

$$L(u, q, \lambda) = E(u) + \alpha^2 R(q) + \langle F(u, q), \lambda\rangle.$$

By first solving for $u$ and then for the adjoint state $\lambda$, we can effectively calculate the derivative of our original objective with respect to the parameters $q$. Under the hood, this is exactly what pyadjoint and (more generally) reverse-mode automatic differentiation does. The interface that pyadjoint presents to us hides the existence of a Lagrange multiplier and instead gives us only a reduced functional $\hat J(q)$.

Generating the exact data

First, we'll need to make a domain and some synthetic input data, which consist of:

  • the sources $f$
  • the external field $g$
  • the exchange coefficient $h$
  • the true log-conductivity field $q$

We have to be careful about what kind of data we use in order to make the problem interesting and instructive. Ideally, the the true log-conductivity field will give a solution that's very different from some kind of blunt, spatially constant initial guess. To do this, we'll first make the external field $g$ a random trigonometric polynomial.

import firedrake
mesh = firedrake.UnitSquareMesh(32, 32, diagonal='crossed')
Q = firedrake.FunctionSpace(mesh, family='CG', degree=2)
V = firedrake.FunctionSpace(mesh, family='CG', degree=2)
import numpy as np
from numpy import random, pi as π
x = firedrake.SpatialCoordinate(mesh)

rng = random.default_rng(seed=1)
def random_fourier_series(std_dev, num_modes, exponent):
    from firedrake import sin, cos
    A = std_dev * rng.standard_normal((num_modes, num_modes))
    B = std_dev * rng.standard_normal((num_modes, num_modes))
    return sum([(A[k, l] * sin(π * (k * x[0] + l * x[1])) +
                 B[k, l] * cos(π * (k * x[0] + l * x[1])))
                / (1 + (k**2 + l**2)**(exponent/2))
                for k in range(num_modes)
                for l in range(int(np.sqrt(num_modes**2 - k**2)))])
from firedrake import interpolate
g = interpolate(random_fourier_series(1.0, 6, 1), V)
import matplotlib.pyplot as plt
firedrake.trisurf(g);

Next, we'll make the medium much more insulating (lower conductivity) near the center of the domain. This part of the medium will tend to soak up any sources much more readily than the rest.

from firedrake import inner, min_value, max_value, Constant
a = -Constant(8.)
r = Constant(1/4)
ξ = Constant((0.4, 0.5))
q_true = interpolate(a * max_value(0, 1 - inner(x - ξ, x - ξ) / r**2), Q)
firedrake.trisurf(q_true);

In order to make the effect most pronounced, we'll stick a blob of sources right next to this insulating patch.

b = Constant(6.)
R = Constant(1/4)
η = Constant((0.7, 0.5))
f = interpolate(b * max_value(0, 1 - inner(x - η, x - η) / R**2), V)
firedrake.trisurf(f);

Once we pick a baseline value $k$ of the conductivity and the exchange coefficient $h$, we can compute the true solution. We'll take the exchange coefficient somewhat arbitrarily to be 10 in this unit system because it makes the results look nice enough.

from firedrake import exp, grad, dx, ds
k = Constant(1.)
h = Constant(10.)
u_true = firedrake.Function(V)
v = firedrake.TestFunction(V)
F = (
    (k * exp(q_true) * inner(grad(u_true), grad(v)) - f * v) * dx +
    h * (u_true - g) * v * ds
)
opts = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'lu',
        'pc_factor_mat_solver_type': 'mumps'
    }
}
firedrake.solve(F == 0, u_true, **opts)
firedrake.trisurf(u_true);

The true value of $u$ has a big hot spot in the insulating region, just as we expect.

Generating the observational data

For realistic problems, what we observe is the true solution plus some random noise $\xi$:

$$u_\text{obs} = u_\text{true} + \xi.$$

The ratio of the variance $\sigma$ of the noise to some scale of the solution, e.g. $\max_\Omega u_\text{true} - \min_\Omega u_\text{true}$, will determine the degree of accuracy that we can expect in the inferred field.

To make this experiment more realistic, we'll synthesize some observations by adding random noise to the true solution. We'll assume that the noise is spatially white, i.e. the covariance of the measurement errors is

$$\mathbb{E}[\xi(x)\xi(y)] = \sigma^2\delta(x - y)$$

where $\delta$ is the Dirac delta distribution. A naive approach would be to add a vector of normal random variables to the finite element expansion coefficients of the true solution, but this will fail for a subtle reason. Suppose that, at every point, the measurement errors $\xi$ are normal with mean 0 and variance $\sigma$. Letting $\mathbb{E}$ denote statistical expectation, we should then have by Fubini's theorem that

$$\mathbb{E}\left[\int_\Omega\xi(x)^2dx\right] = \int_\Omega\mathbb{E}[\xi(x)^2]dx = \sigma^2\cdot|\Omega|.$$

The naive approach to synthesizing the noise will give us the wrong value of the area-averaged variance.

ξ = firedrake.Function(V)
n = len(ξ.dat.data_ro)
ξ.dat.data[:] = rng.standard_normal(n)

firedrake.assemble(ξ**2 * dx)
0.6237269211354274

The "right" thing to do is:

  1. Compute the finite element mass matrix $M$
  2. Compute the Cholesky factorization $M = LL^*$
  3. Generate a standard normal random vector $z$
  4. The finite element expansion coefficients for the noise vector are
$$\hat\xi = \sigma\sqrt{\frac{|\Omega|}{n}}L^{-*}z.$$

You can show that this works out correctly by remembering that

$$\int_\Omega\xi^2dx = \hat\xi^*M\hat\xi.$$

We'll have to do a bit of hacking with PETSc data structures directly in order to pull out one of the Cholesky factors of the mass matrix.

from firedrake.petsc import PETSc
ϕ, ψ = firedrake.TrialFunction(V), firedrake.TestFunction(V)
m = inner(ϕ, ψ) * dx
M = firedrake.assemble(m, mat_type='aij').M.handle
ksp = PETSc.KSP().create()
ksp.setOperators(M)
ksp.setUp()
pc = ksp.pc
pc.setType(pc.Type.CHOLESKY)
pc.setFactorSolverType(PETSc.Mat.SolverType.PETSC)
pc.setFactorSetUpSolverType()
L = pc.getFactorMatrix()
pc.setUp()

Since our domain is the unit square, it has an area of 1, but for good measure I'll include this just to show the correct thing to do.

area = firedrake.assemble(Constant(1) * dx(mesh))
z = firedrake.Function(V)
z.dat.data[:] = rng.standard_normal(n)
with z.dat.vec_ro as Z:
    with ξ.dat.vec as Ξ:
        L.solveBackward(Z, Ξ)
        Ξ *= np.sqrt(area / n)

The error statistics are within spitting distance of the correct value of 1.

firedrake.assemble(ξ**2 * dx) / area
0.9898623684079142

The answer isn't exactly equal to one, but averaged over a large number of trials or with a larger mesh it will approach it. Finally, we can make the "observed" data. We'll use a signal-to-noise ratio of 50, but it's worth tweaking this value and seeing how the inferred parameters change.

 = u_true.dat.data_ro[:]
signal = .max() - .min()
signal_to_noise = 50
σ = firedrake.Constant(signal / signal_to_noise)

u_obs = u_true.copy(deepcopy=True)
u_obs += σ * ξ

The high-frequency noise you can see in the plot below is exactly what makes regularization necessary.

firedrake.trisurf(u_obs);

Calculating derivatives

Now we can import firedrake-adjoint. Under the hood, this will initialize the right data structures to calculate derivatives using the adjoint method, and we can even take a peek at those data structures.

import firedrake_adjoint

We'll start with a fairly neutral initial guess that the log-conductivity $q$ is identically 0.

q = firedrake.Function(Q)
u = firedrake.Function(V)
F = (
    (k * exp(q) * inner(grad(u), grad(v)) - f * v) * dx +
    h * (u - g) * v * ds
)
firedrake.solve(F == 0, u, **opts)

The computed solution with a constant conductivity doesn't have the gigantic spike in the insulating region, so it's very easy to tell them apart. When the differences are really obvious it makes it easier to benchmark a putative solution procedure.

firedrake.trisurf(u);

Just to give a sense of how different the initial value of the observed field is from the true value, we can calculate the relative difference in the 2-norm:

print(firedrake.norm(u - u_true) / firedrake.norm(u_true))
0.29858594938745037

Now we can start having some fun with Firedrake's adjoint capabilities. A lot of what we're going to do can seem like magic and I often find it a little bewildering to have no idea what's going on under the hood. Much of this machinery works by overloading functionality within Firedrake and recording operations to a tape. The tape can then in effect be played backwards to perform reverse-mode automatic differentiation. You can access the tape explicitly from the Firedrake adjoint API, which conveniently provides functions to visualise the tape using graphviz or NetworkX. The plot below shows the overall connectivity of the structure of the tape; you can query the nodes using NetworkX to get a better idea of what each one represents. This tape will grow and grow as we calculate more things and it's a common failure mode for an adjoint calculation to eat up all the system memory if you're not careful.

import networkx
tape = firedrake_adjoint.get_working_tape()
graph = tape.create_graph(backend='networkx')
fig, axes = plt.subplots()
networkx.draw_kamada_kawai(graph, ax=axes);

Hopefully this gives you some sense of how all this machinery works at a lower level. For more details you can see the dolfin-adjoint documentation, which has loads of commentary on both the math and the code by its author, Patrick Farrell.

To start on solving the inverse problem, we're going to declare that $q$ is the control variable, i.e. it's the thing that want to optimize over, as opposed to the field $u$ that we can observe.

 = firedrake_adjoint.Control(q)

Next we'll create the objective functional, which measures both the degree to which our computed solution $u$ differs from the true solution and the oscillations in our guess $q$. Normally, we might create a symbolic variable (a Firedrake Form type) that represents this functional. If we wanted to get an actual number out of this symbolic object, we would then call assemble. So it might stick out as unusual that we're assembling the form right away here.

α = Constant(5e-2)
J = firedrake.assemble(
    0.5 * ((u - u_obs) / σ)**2 * dx +
    0.5 * α**2 * inner(grad(q), grad(q)) * dx
)

In fact there's a bit of magic going under the hood; J isn't really a floating point number, but a more complex object defined within the pyadjoint package. The provenance of how this number is calculated is tracked through the adjoint tape.

print(type(J))
<class 'pyadjoint.adjfloat.AdjFloat'>

We can get an actual number out of this object by casting it to a float.

print(float(J))
4.975344614983414

The advantage of having this extra layer of indirection is that, as the control variable $q$ changes, so does $J$ and firedrake-adjoint will track the sensitivity under the hood for you. The next step is to somehow wire up this functional with the information that $u$ isn't really an independent variable, but rather a function of the control $q$. This is what the ReducedFunctional class does for us.

 = firedrake_adjoint.ReducedFunctional(J, )

The reduced functional has a method to calculate its derivative with respect to the control variable.

dĴ_dq = .derivative()

At this point, it might be tempting to start using this method as the basis for an optimization algorithm. But there's a subtle and important point missing that I think is worth dwelling on a bit. To immediately show you that something's wrong, let's plot the derivative of the reduced functional.

fig, axes = plt.subplots()
contours = firedrake.tricontourf(dĴ_dq, 40, axes=axes)
fig.colorbar(contours);

You can see a very obvious stippling pattern in this plot which happens to align with the mesh. A simplistic explanation for this phenomenon is that we're using 2nd-order finite elements to describe the inferred field $q$. The edge degrees of freedom are not connected to as many others as are the vertex degrees of freedom. Things might be even worse if we were using, say, cubic elements, which have purely internal degrees of freedom. What gives?

A better explanation requires us to revisit some of the underlying mathematics. The reduced functional $\hat J$ is a differentiable mapping of the function space $Q$ into the real numbers. The derivative $d\hat J/dq$ at a particular value of the control variable is an element of the dual space $Q^*$. As mathematicians, we grow accustomed to thinking of Hilbert spaces as being isometric to their duals. It's easy to forget that isometric does not mean identical; the mapping between primal and dual can be non-trivial. For example, suppose $Q$ is the Sobolev space $H^1(\Omega)$. The dual space $H^{-1}(\Omega)$ is isometric to the primal, but to evaluate the mapping between them, we have to solve an elliptic PDE.

The Sobolev space $H^1(\Omega)$ is a relatively tame one in the grand scheme of things. Real problems might involve controls in Banach spaces with no inner product structure at all. For example, the conductivity coefficient has to be bounded and positive, so we're probably looking in some cone in the space $L^\infty(\Omega)$. In general, conductivity fields can be discontinuous, although not wildly so. We might then want to look in the intersection of $L^\infty$ with the space $BV(\Omega)$ of functions whose first derivatives are finite signed measures.

Nonetheless, the discretization via finite elements can obscure the distinction between the primal and dual spaces. The control $q$ and the derivative $d\hat J/dq$ contain within them a wad of data that happens to look the same: an array of floating point numbers, the size of which is equal to the number of vertices + the number of edges of the mesh for our P2 discretization. What's confusing is that these numbers don't mean the same thing. The array living under $q$ represents its coefficients in the finite element basis for the space $Q$, while the array for $d\hat J/dq$ represents its coefficients in the dual basis. To get the action of $d\hat J/dq$ on some perturbation field $\phi$, we take the (Euclidean) dot product of the wads of data living underneath them. This is in distinct contrast to getting the inner product in, say, $L^2(\Omega)$ of $\phi$ with another function $\psi$, where the inner product is instead calculated using the finite element mass matrix.

So, where does that leave us? We need some way of mapping the dual space $Q^*$ back to the primal. This mapping is referred to in the literature as the Riesz map after the Riesz representation theorem. The laziest way we could possibly do so is to multiply $d\hat J/dq$ by the inverse of the finite element mass matrix. Really we should include a 2nd-order elliptic operator too, since we've assumed that the controls live in an $H^1$-conforming space, but for illustrative purposes the mass matrix will do fine.

ϕ, ψ = firedrake.TrialFunction(Q), firedrake.TestFunction(Q)
M = firedrake.assemble(ϕ * ψ * dx)
solver = firedrake.LinearSolver(M, **opts)
z = firedrake.Function(Q)
solver.solve(z, dĴ_dq)

Applying the Riesz map gives us something more reasonable-looking, free of the stippling artifacts we saw in the last plot.

fig, axes = plt.subplots()
contours = firedrake.tricontourf(z, 40, axes=axes)
fig.colorbar(contours);

Keeping track of which quantities live in the primal space and which live in the dual space is one of the challenging parts of solving PDE-constrained optimization problems. Most publications on numerical optimization assume the problem is posed over Euclidean space. In that setting, there's effectively no distinction between primal and dual. You can see this bias reflected in software packages that purport to solve numerical optimization problems; almost none of them have support for supplying a matrix other than the identity that defines the dual pairing. The fact that a Sobolev space isn't identical to its dual has some unsettling consequences. For starters, the venerable gradient descent method isn't even well-posed over Sobolev spaces. If you can rely on the built-in optimization routines from pyadjoint, you'll largely be insulated from this problem. But if you've read this far there's a good chance that you'll have to roll your own solvers at some point in your life. To paraphrase the warning at gate of Plato's academy, let none ignorant of duality enter there.

Solving the inverse problem

Ok, screed over. Let's do something useful now. The firedrake-adjoint package contains several routines to minimize the reduced objective functional and for our particular problem, these built-in routines will work just fine. Let's see how well we can recover the log-conductivity field.

q_opt = firedrake_adjoint.minimize()
firedrake.trisurf(q_opt);

The optimization procedure has correctly identified the drop in the conductivity of the medium to within our smoothness constraints. Nonetheless, it's clear in the eyeball norm that the inferred field doesn't completely match the true one.

firedrake.norm(q_opt - q_true) / firedrake.norm(q_true)
0.2863633696322895

What's a little shocking is the degree to which the computed state matches observations despite these departures. If we plot the computed $u$, it looks very similar to the true value.

q.assign(q_opt)
firedrake.solve(F == 0, u, **opts)
firedrake.trisurf(u);

Moreover, if we compute the model-data misfit and weight it by the standard deviation of the measurement errors, we get a value that's roughly around 1/2.

firedrake.assemble(0.5 * ((u - u_obs) / σ)**2 * dx)
0.5139507000594687

This value is about what we would expect from statistical estimation theory. Assuming $u$ is an unbiased estimator for the true value of the observable state, the quantity $((u - u^o) / \sigma)^2$ is a $\chi^2$ random variable. When we integrate over the whole domain and divide by the area (in this case 1), we're effectively summing over independent $\chi^2$ variables and so we should get a value around 1/2.

Recall that we used a measurement error $\sigma$ that was about 2\% of the true signal, which is pretty small. You can have an awfully good signal-to-noise ratio and yet only be able to infer the conductivity field to within a relative error of 1/4. These kinds of synthetic experiments are really invaluable for getting some perspective on how good of a result you can expect.

Total variation regularization

In the previous post, I showed how to solve inverse problems for coefficients of elliptic PDE using firedrake-adjoint. The exact parameter field that I used in that demonstration was smooth in space and, to guarantee a smooth solution, I showed how to add regularization to the objective functional. Many geophysical inverse problems aim to estimate fields that instead have sharp discontinuities or interfaces. For example, the porosity of soil and hard bedrock are very different and there is no continuous transition between the two. For these media, the regularization functional

$$R(q) = \frac{1}{2}\int_\Omega|\nabla q|^2 dx$$

that we used in that demonstration would yield an infinite value. The inferred field with this penalty would have a more diffuse interface than the real one.

Rather than use the integrated square gradient, we can instead use the total variation functional:

$$R(q) = \int_\Omega|\nabla q|dx.$$

We can get some insight into why the total variation is a good regularizer for these types of problems by using the very wonderful coarea formula. The coarea formula states that, for reasonable functions $p$ and $q$, we can express certain integrals involving the gradient of $q$ in terms of its contours or level sets. Let $ds$ be the element of surface area, let $z$ be an arbitrary real value, and let $\Gamma_z$ be the $z$-contour surface of the function $q$. Then

$$\int_\Omega p|\nabla q|dx = \int_{-\infty}^\infty\int_{\Gamma_z}p\, ds\, dz.$$

The right-hand side of the last equation can make sense even when $q$ is discontinuous, provided we're a little careful in the definition of the $z$-contour of $q$:

$$\Gamma_z = \partial\{x \in \Omega: q(x) \le z\}.$$

For example, suppose that $\Gamma$ is some nice closed surface inside $\Omega$, and we take $q$ to be equal to $\alpha$ in the interior of $\Gamma$ and $0$ outside. Then the coarea formula tells us that

$$\int_\Omega|\nabla q|dx = a\cdot|\Gamma|.$$

This partly explains why the total variation functional is an effective regularizer. While it doesn't forbid a jump discontinuity as such, it instead penalizes (1) the magnitude of the jump and (2) the area of the surface over which it occurs. Gabriel Peyré has a nice visualization of the coarea formula on Twitter.

Calculating total variation

A new difficulty that we'll encounter here is that the total variation functional doesn't have a well-defined functional derivative like the mean square gradient does. It is a convex functional, so the minimum is well-defined, but we might be at a loss for an algorithm to actually approximate it.

We've already encountered the mathematical concepts that we'll need to remedy this issue in a previous post on the obstacle problem. The obstacle problem is the prototypical example of an optimization problem with inequality constraints. To solve it, we reformulated the obstacle problem as an unconstrained convex optimization problem where the objective could take the value $+\infty$. We then smoothed away the infinite values by instead working with the Moreau envelope of that functional.

Many of the same tricks work for total variation-type problems because the Moreau envelope of the $L^1$-norm has a simple analytical expression in terms of the Huber function:

$$H_\gamma(z) = \begin{cases}\frac{1}{2\gamma}|z|^2 & |z| < \gamma \\ |z| - \frac{\gamma}{2} & |z| \ge \gamma \end{cases}$$

The Huber function looks like the $L^1$ norm for large values of the argument, but like the square norm for small values.

import numpy as np
zs = np.linspace(-5., 5., 41)
γ = 2.
H_γ = [z**2 / (2 * γ) if abs(z) < γ else abs(z) - γ / 2 for z in zs]
import matplotlib.pyplot as plt
plt.plot(zs, H_γ);

The Moreau envelope of the 1-norm can be expressed through the Huber function. Suppose that $z$ and $w$ are vector fields in the space $L^2(\Omega)$; then

$$\inf_w\left(\int_\Omega|w|dx + \frac{1}{2\gamma}\int_\Omega|z - w|^2dx\right) = \int_\Omega H_\gamma(z)dx.$$

This Huber functional does have a well-defined functional derivative for positive values of the penalty parameter $\gamma$, so we can reuse our old gradient-based optimization routines. As with any penalty-type method, however, the problem becomes more ill-conditioned as we decrease $\gamma$. There's one more critical fact we'll need. We know how to calculate the Moreau envelope of the 1-norm, but our regularization functional is instead a scalar multiple of the 1-norm. If we denote the envelope of a functional $R$ by $\text{env}_\gamma R$, then

$$\text{env}_\gamma\, \alpha\cdot R = \alpha\cdot\text{env}_{\gamma\alpha}R.$$

With this identity in hand, we can instead try to minimize the approximate objective functional

$$J_\gamma(u, q) = \frac{1}{2}\int_\Omega\left(\frac{u - u^o}{\sigma}\right)^2dx + \alpha\int_\Omega H_{\alpha\gamma}\left(\nabla q\right)\,dx.$$

Letting $F$ be the weak form of the Poisson equation, our constraint is that

$$\langle F(u, q), v\rangle = 0$$

for all test functions $v$. Recall that the scalar parameter $\alpha$ that dictates the degree to which we're regularizing the problem, and has units of $[q]^{-1}\cdot[x]$. When we were using the mean square gradient to regularize the problem, this quantity was raised to the power 2 to make the units work out correctly. Here the exponent is 1 instead. Moreover, we can make an educated guess for what $\alpha$ might be if we know roughly the numerical range of the field we're inferring and the diameter of the domain.

Generating the exact data

We'll proceed much like we did in the last post, only the conductivity field will have a sharp interface such as one might find between two distinct media. To make things a little easier later, we'll actually use a continuous Galerkin basis, in which case the interface will extend over a single grid cell. This is a little bit sinful and really we should be using a DG basis. But that would involve a bunch of annoying facet integrals that distract from the real point. We can just as easily illustrate the essential idea using continuous basis functions.

import firedrake
mesh = firedrake.UnitSquareMesh(32, 32, diagonal='crossed')
Q = firedrake.FunctionSpace(mesh, family='CG', degree=2)
V = firedrake.FunctionSpace(mesh, family='CG', degree=2)
from numpy import random, pi as π
x = firedrake.SpatialCoordinate(mesh)

rng = random.default_rng(seed=1)
def random_fourier_series(std_dev, num_modes, exponent):
    from firedrake import sin, cos
    A = std_dev * rng.standard_normal((num_modes, num_modes))
    B = std_dev * rng.standard_normal((num_modes, num_modes))
    return sum([(A[k, l] * sin(π * (k * x[0] + l * x[1])) +
                 B[k, l] * cos(π * (k * x[0] + l * x[1])))
                / (1 + (k**2 + l**2)**(exponent/2))
                for k in range(num_modes)
                for l in range(int(np.sqrt(num_modes**2 - k**2)))])
from firedrake import interpolate, project
g = interpolate(random_fourier_series(1.0, 6, 1), V)
from firedrake import inner, max_value, conditional, Constant
a = -Constant(4.5)
r = Constant(1/4)
ξ = Constant((0.4, 0.5))
q_true = interpolate(a * conditional(inner(x - ξ, x - ξ) < r**2, 1, 0), Q)
firedrake.trisurf(q_true);
b = Constant(6.)
R = Constant(1/4)
η = Constant((0.7, 0.5))
f = interpolate(b * max_value(0, 1 - inner(x - η, x - η) / R**2), V)
from firedrake import exp, grad, dx, ds
k = Constant(1.)
h = Constant(10.)
u_true = firedrake.Function(V)
v = firedrake.TestFunction(V)
F = (
    (k * exp(q_true) * inner(grad(u_true), grad(v)) - f * v) * dx +
    h * (u_true - g) * v * ds
)
opts = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'lu',
        'pc_factor_mat_solver_type': 'mumps'
    }
}
firedrake.solve(F == 0, u_true, **opts)

The true solution shares many properties with that of the previous demo, namely the sharp spike in the middle of the domain where the medium becomes more insulating. An interesting feature you can see here is how there's a break in slope across the discontinuity. This is a general feature of sharp interface problems; the flux is discontinuous, even though the gradient of the solution is not.

firedrake.trisurf(u_true);

Generating the observational data

To create the synthetic observations, we'll proceed along the same lines as in the last post. Recall that these incantations were necessary because generating a random field with the correct error statistics using a finite element basis does all sorts of weird unintuitive things.

ξ = firedrake.Function(V)
n = len(ξ.dat.data_ro)
ξ.dat.data[:] = rng.standard_normal(n)
from firedrake import assemble
from firedrake.petsc import PETSc
ϕ, ψ = firedrake.TrialFunction(V), firedrake.TestFunction(V)
m = inner(ϕ, ψ) * dx
M = assemble(m, mat_type='aij').M.handle
ksp = PETSc.KSP().create()
ksp.setOperators(M)
ksp.setUp()
pc = ksp.pc
pc.setType(pc.Type.CHOLESKY)
pc.setFactorSolverType(PETSc.Mat.SolverType.PETSC)
pc.setFactorSetUpSolverType()
L = pc.getFactorMatrix()
pc.setUp()
area = assemble(Constant(1) * dx(mesh))
z = firedrake.Function(V)
z.dat.data[:] = rng.standard_normal(n)
with z.dat.vec_ro as Z:
    with ξ.dat.vec as Ξ:
        L.solveBackward(Z, Ξ)
        Ξ *= np.sqrt(area / n)
 = u_true.dat.data_ro[:]
signal = .max() - .min()
signal_to_noise = 50
σ = firedrake.Constant(signal / signal_to_noise)

u_obs = u_true.copy(deepcopy=True)
u_obs += σ * ξ

Solution via Moreau envelopes

We will again use the blunt initial guess that $q = 0$ everywhere.

import firedrake_adjoint
q = firedrake.Function(Q)
u = firedrake.Function(V)
F = (
    (k * exp(q) * inner(grad(u), grad(v)) - f * v) * dx +
    h * (u - g) * v * ds
)
firedrake.solve(F == 0, u, **opts)

Once again, the initial computed solution lacks the large spike in the insulating part of the medium.

firedrake.trisurf(u);

The Huber functional is easy to express in UFL. Before writing this, I tried to find a few different ways to express the Huber functional in a way that might be more amenable to symbolic differentiation because I thought that the conditional would prove to be a huge problem. None of those worked out mathematically, but to my surprise, it seems as if Firedrake can still calculate functional derivatives of conditional expressions just fine.

from firedrake import sqrt
def huber(v, γ):
    return firedrake.conditional(
        inner(v, v) < γ**2,
        inner(v, v) / (2 * γ),
        sqrt(inner(v, v)) - γ / 2
    )

We'll use the same value $\alpha$ for the smoothing length as in the previous demo, noting again that it's to the power of 1 instead of 2 this time. But we don't have much in the way of a priori guidance for how to pick the Moreau envelope parameter $\gamma$, which should be dimensionless. I arrived at the following value by trial and error.

α = Constant(5e-2)
γ = Constant(4e2)

J = assemble(
    0.5 * ((u - u_obs) / σ)**2 * dx +
    α * huber(grad(q), α * γ) * dx
)

The procedure to compute the solution works just the same as before.

 = firedrake_adjoint.Control(q)
 = firedrake_adjoint.ReducedFunctional(J, )
q_γ = firedrake_adjoint.minimize()

The resulting approximation does a fairly good job capturing the sharp gradients in the inferred field around part of the interface. On the upper left side, the jump has been blurred out, which is to be expected based on the form of the Huber approximation.

firedrake.tricontourf(q_γ, 40)
<matplotlib.tri.tricontour.TriContourSet at 0x7fd35f05ab80>

Finally, let's see how well our approximation matches the true value:

assemble(abs(q_true - q_γ) * dx) / assemble(abs(q_true) * dx)
0.4745209247267914

Next let's see what happens when we use a much smaller value of the envelope parameter. If you run this code yourself, you can observe first-hand how the problem gets much more ill-conditioned as you decrease $\gamma$ by how much longer it takes to get a solution!

Γ = Constant(10)

J = assemble(
    0.5 * ((u - u_obs) / σ)**2 * dx +
    α * huber(grad(q), α * Γ) * dx
)

 = firedrake_adjoint.Control(q)
 = firedrake_adjoint.ReducedFunctional(J, )
q_Γ = firedrake_adjoint.minimize()

The resulting solution gives a substantially better fit to the true parameters.

assemble(abs(q_true - q_Γ) * dx) / assemble(abs(q_true) * dx)
0.2600467478473466

Moreover, the interface is noticeably much sharper when we decrease $\gamma$, as expected.

firedrake.tricontourf(q_Γ, 40)
<matplotlib.tri.tricontour.TriContourSet at 0x7fd35e560f70>

At a few points, the mesh triangles have made themselves apparent. To fix this, we would want to either adapt the mesh to the contours of $q$, or do things the blunt way and uniformly refine until this effect was no longer obvious.

Discussion

Regularization of inverse problems is a subtle topic. When the field we're inferring has sharp discontinuities, the total variation functional can be a better regularizer than the mean square gradient. There's a statistical way to make that argument more precise: TV better represents the prior information that we're claiming to have about our solution. Using it incurs a cost in implementation complexity, however, because the TV functional is non-smooth. As with the obstacle problem, the Moreau envelope provides us with a way to solve non-smooth optimization problems using tools that were designed only for smooth ones.

We've examined what regularization functional to use here, but we haven't examined the topic of deciding how much to regularize. There are several procedures for choosing a value of $\alpha$, some merely heuristic and others based on deeper insights from Bayesian statistical inference. In the statistical literature this is often referred to as hyperparameter optimization and it's arguably just as important as deciding which regularization functional to use in the first place.

The shallow water equations

In the previous post, we explored how difficult it is to solve the simplest hyperbolic conservation law, the scalar advection equation. To solve this PDE accurately, we had to understand how the partly arbitrary choice of a numerical flux can make or break the stability of the spatial discretization, how low order schemes are very diffusive, and how higher-order explicit schemes introduce spurious maxima and minima that we can only control through a nonlinear flux limiting procedure. The scalar advection equation is comparatively simple in that signals can only propagate along the predetermined velocity field. In this post, we'll look at something more realistic and much more difficult: the shallow water equations. The shallow water equations are a system of equations rather than a scalar problem, and as such they can exhibit non-trivial wave propagation in a way that the advection equation can't. They're a great model for testing numerical solvers because they're both simple enough to keep in your head all at once, and yet at the same time they exhibit many of the complexities of more "serious" models -- nonlinearity, non-trivial conserved quantities, mimetic properties, the list goes on.

The shallow water equations can be derived from the incompressible Euler equations of fluid dynamics with a free surface under the assumption that the horizontal length scale is much longer than the vertical one. This approximation reduces the unknowns to the thickness $h$ of the fluid and the depth-averaged velocity $u$. The conservation laws are for mass and momentum:

$$\begin{align} \frac{\partial}{\partial t}h + \nabla\cdot hu & = 0 \\ \frac{\partial}{\partial t}hu + \nabla\cdot\left(hu\otimes u + \frac{1}{2}gh^2I\right) & = -gh\nabla b \end{align}$$

where $g$ is the acceleration due to gravity, $b$ is the bathymetry, and $I$ is the identity matrix. This problem is a little more complicated because of the time derivative on $h\cdot u$, a combination of two of the state variables. To make things a little easier, we'll instead work with the momentum $q = h\cdot u$ and rewrite the system as

$$\begin{align} \frac{\partial}{\partial t}h + \nabla\cdot q & = 0 \\ \frac{\partial}{\partial t}q + \nabla\cdot\left(h^{-1}q\otimes q + \frac{1}{2}gh^2I\right) & = -gh\nabla b. \end{align}$$

As in the previous post, we'll use a discontinuous Galerkin basis. We showed that there is more than one way to come up with a discretized problem that is consistent with the idealized one and this is manifested in which numerical flux to use. Things get much more interesting for systems of PDE, which can have more than one characteristic speed besides that of the background flow field. In the following, I'll assume you're familiar with the fact that the characteristic wave speed for the shallow water equations is

$$c = |u| + \sqrt{gh}.$$

The fact that the wave speed now depends on the solution and that waves propagate in all directions instead of only along a pre-set vector field has several consequences. First, we can't pick a CFL-stable timestep from the outset because the fluid velocity and thickness could increase well beyond their initial values. The only options for timestepping are to use an adaptive procedure or a whole mess of trial and error. Second, we have to think harder about numerical fluxes. For scalar equations, we can use the numerical flux to mimic the properties of upwind finite difference schemes, but for systems this reasoning doesn't work -- there can be waves simultaneously propagating in both normal directions at a given internal facet.

Spatial discretization

We'll examine several different test problems for the shallow water equations, so to avoid a huge amount of repeated code we'll first write a few Python functions to set up the weak form of the problem.

import firedrake
from firedrake import Constant
g = Constant(9.81)
I = firedrake.Identity(2)

The component of the fluxes in the interior of the cells is exactly what you get from applying the divergence theorem to the original system.

from firedrake import inner, grad, dx
def cell_flux(z):
    Z = z.function_space()
    h, q = firedrake.split(z)
    ϕ, v = firedrake.TestFunctions(Z)
    
    f_h = -inner(q, grad(ϕ)) * dx

    F = outer(q, q) / h + 0.5 * g * h**2 * I
    f_q = -inner(F, grad(v)) * dx

    return f_h + f_q

Now we need to add the facet terms and this is where the numerical fluxes come in. We'll start with a function to compute the central flux. This is the easy part -- there are no choices to make.

from firedrake import avg, outer, dS
def central_facet_flux(z):
    Z = z.function_space()
    h, q = firedrake.split(z)
    ϕ, v = firedrake.TestFunctions(Z)
    
    mesh = z.ufl_domain()
    n = firedrake.FacetNormal(mesh)

    f_h = inner(avg(q), ϕ('+') * n('+') + ϕ('-') * n('-')) * dS

    F = outer(q, q) / h + 0.5 * g * h**2 * I
    f_q = inner(avg(F), outer(v('+'), n('+')) + outer(v('-'), n('-'))) * dS
    
    return f_h + f_q

The central flux by itself is unstable with explicit timestepping schemes and to reocver stability we need an extra diffusive flux term. The subtle choices are all in this diffusive flux. For the remainder of this demo, we'll use the Lax-Friedrichs numerical flux. This flux uses the maximum outgoing wave speed to set the local diffusion coefficient. An upper bound for the outgoing wavespeed is then $|c| = |q/h \cdot n| + \sqrt{gh}$. I say an upper bound and not the actual maximum value because the system could exhibit the shallow water equivalent of supersonic flow -- where the speed $|u|$ exceeds $\sqrt{gh}$, both waves could be moving in the same direction rather than opposite each other.

The vast majority of the literature you'll find on DG methods uses the Lax-Friedrichs flux. For example, Cockburn and Shu (1998) in the last of their famous series of papers on the Runge-Kutta DG method consider only the Lax-Friedrichs flux and neglect to even mention the alternatives. This can be confusing for beginners to the subject because it isn't clear at what points in the process you (the programmer) have choices to make and where you don't. Part of the reason for this singular focus is that the Lax-Friedrichs flux is the simplest to implement, but several studies have found -- at least for problems without shock waves -- that other flux functions offer negligible gains in accuracy; see for example Qiu (2006) and Bernard et al. (2009). In that case, there isn't much value in using a different flux function that may be more expensive to compute and make the code harder to understand and maintain. The choice of numerical flux is more consequential for problems with shock waves or for more complex systems (see for example Beck et al. (2014), which studied turbulent flow).

from firedrake import sqrt, max_value
def lax_friedrichs_facet_flux(z):
    Z = z.function_space()
    h, q = firedrake.split(z)
    ϕ, v = firedrake.TestFunctions(Z)
    
    mesh = h.ufl_domain()
    n = firedrake.FacetNormal(mesh)
    
    c = abs(inner(q / h, n)) + sqrt(g * h)
    α = avg(c)
    
    f_h = -α * (h('+') - h('-')) * (ϕ('+') - ϕ('-')) * dS
    f_q = -α * inner(q('+') - q('-'), v('+') - v('-')) * dS

    return f_h + f_q

Finally, we'll do a few experiments with variable bottom topography as well.

def topographic_forcing(z, b):
    Z = z.function_space()
    h = firedrake.split(z)[0]
    v = firedrake.TestFunctions(Z)[1]

    return -g * h * inner(grad(b), v) * dx

Time discretization

We'll consider two different timestepping methods: the forward Euler scheme and the strong stability-preserving Runge-Kutta method of order 3, or SSPRK3 for short. Since we'll be using these quite a bit we'll factor them out into classes that store the needed internal state.

In the previous demo on the conservative advection equation, we used solver parameters that specified a block ILU preconditioner for the linear system. One application of this preconditioner gives an exact solution to the linear system because the mass matrix for DG discretizations is block diagonal. We're doing something a little different here by using a mixed function space for the thickness and momentum because it makes the time discretization much easier. But as a consequence the mass matrix that Firedrake will build under the hood uses a nested storage format -- probably compressed row storage for the thicknesses and block CRS for the momentum, but this shouldn't matter. In order to achieve the same exact linear solve effect here that we had for the conservative advection equation, we'll specify an outer-level fieldsplit preconditioner. Fieldsplit lets us separately specify preconditioners for each block, and those inner preconditioners will be our good old ILU + block Jacobi. You're likely to encounter fieldsplit preconditioners again if you ever have solved mixed finite element problems like the Stokes equations.

block_parameters = {
    'ksp_type': 'preonly',
    'pc_type': 'ilu',
    'sub_pc_type': 'bjacobi'
}

parameters = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'fieldsplit',
        'fieldsplit_0': block_parameters,
        'fieldsplit_1': block_parameters
    }
}

I've defined a time integrator class below that provides the bare minimum amount of functionality to do what we need. The data stored in this class consists of the current value of the state variables, an auxiliary state for the value at the next timestep, and a cached solver object for the mass matrix solve. The step method lets us advance the simulation by a single timestep which we can change on one invocation to the next. In principle we could do adaptive timestepping with this implementation.

from firedrake import (
    NonlinearVariationalProblem as Problem,
    NonlinearVariationalSolver as Solver
)

class ForwardEuler:
    def __init__(self, state, equation):
        z = state.copy(deepcopy=True)
        F = equation(z)
        
        z_n = z.copy(deepcopy=True)
        Z = z.function_space()
        w = firedrake.TestFunction(Z)
        
        dt = firedrake.Constant(1.)

        problem = Problem(inner(z_n - z, w) * dx - dt * F, z_n)
        solver = Solver(problem, **parameters)
        
        self.state = z
        self.next_state = z_n
        self.timestep = dt
        self.solver = solver
        
    def step(self, timestep):
        self.timestep.assign(timestep)
        self.solver.solve()
        self.state.assign(self.next_state)

Demonstration

Our first test problem will be a periodic domain with a flat bottom. The initial state of the system will consist of a spherical perturbation to the water thickness, and we'll look at how this disturbance evolves in time.

nx, ny = 32, 32
Lx, Ly = 20., 20.
mesh = firedrake.PeriodicRectangleMesh(
    nx, ny, Lx, Ly, diagonal='crossed'
)

x = firedrake.SpatialCoordinate(mesh)
lx = 5.
y = Constant((lx, lx))
r = Constant(2.5)

h_0 = Constant(1.)
δh = Constant(1/16)
h_expr = h_0 + δh * max_value(0, 1 - inner(x - y, x - y) / r**2)
DG(0) basis; or, the finite volume method

We'll start by using the simplest discretization possible: piecewise constant basis functions for both the thickness and momentum. This method is identical to the lowest-order finite volume method. We'll also use a mixed function space $Z = Q \times V$ that includes both the thickness and momentum. This choice isn't strictly necessary but it makes it that much easier to write the time integrator you saw above. The code is equivalent to a single update for the combined state vector $z = (h, q)$ rather than two separate updates for each.

Q0 = firedrake.FunctionSpace(mesh, family='DG', degree=0)
V0 = firedrake.VectorFunctionSpace(mesh, family='DG', degree=0)
Z0 = Q0 * V0

The split method of functions in mixed spaces give us the tuple of components. That way we can initialize the thickness to the expression defined just above. Note that the method split of functions in mixed spaces is different from the module-level function split, which gives us symbolic objects.

z0 = firedrake.Function(Z0)
h0, q0 = z0.split()
h0.project(h_expr);
solver = ForwardEuler(
    z0,
    lambda z: (
        cell_flux(z) +
        central_facet_flux(z) +
        lax_friedrichs_facet_flux(z)
    )
)

Since we'll be running the same simulation many times, we'll wrap up the loop in a function.

import tqdm

def run_simulation(solver, final_time, num_steps, output_freq):
    hs, qs = [], []
    qs = []
    pbar = tqdm.trange(num_steps)
    for step in pbar:
        if step % output_freq == 0:
            h, q = solver.state.split()
            hmin, hmax = h.dat.data_ro.min(), h.dat.data_ro.max()
            pbar.set_description(f'{hmin:5.3f}, {hmax:5.3f}')
            hs.append(h.copy(deepcopy=True))
            qs.append(q.copy(deepcopy=True))

        solver.step(timestep)
    
    return hs, qs
final_time = 10.
timestep = 5e-3
num_steps = int(final_time / timestep)
output_freq = 10
hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
1.000, 1.003: 100%|██████████| 2000/2000 [00:08<00:00, 246.23it/s]
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

def make_animation(hs, b, timestep, output_freq, **kwargs):
    fig, axes = plt.subplots()
    axes.set_aspect('equal')
    axes.set_xlim((0.0, Lx))
    axes.set_ylim((0.0, Ly))
    axes.get_xaxis().set_visible(False)
    axes.get_yaxis().set_visible(False)
    η = firedrake.project(hs[0] + b, hs[0].function_space())
    colors = firedrake.tripcolor(
        hs[0], num_sample_points=1, axes=axes, **kwargs
    )
    
    def animate(h):
        η.project(h + b)
        colors.set_array(η.dat.data_ro[:])

    interval = 1e3 * output_freq * timestep
    animation = FuncAnimation(fig, animate, frames=hs, interval=interval)
    
    plt.close(fig)
    return HTML(animation.to_html5_video())
make_animation(
    hs, Constant(0), timestep, output_freq, vmin=0.98, vmax=1.03
)

We get the expected propagation at first, but the method is so diffusive that the waves quickly get washed out. Another sanity check that we'll repeat through the following is to track the total energy of the system. The shallow water equations conserve the quantity

$$E = \frac{1}{2}\int_\Omega\left(\frac{|q|^2}{h} + g(h + b)^2\right)dx.$$

(There's a Hamiltonian formulation of the shallow water equations based on this energy functional, although the Poisson bracket is a little weird.) Approximately conserving the total energy is especially important for long-running simulations of systems where physical dissipation mechanisms are relatively weak or non-existent. The plot below shows that the explicit Euler scheme with DG(0) basis functions and the Lax-Friedrichs flux dissipates energy.

energies = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * h**2) * dx
    )
    for h, q in zip(hs, qs)
]
fig, axes = plt.subplots()
axes.plot(energies);
With topography

What happens if we add a bump at the bed? We'll use a similar perturbation to the initial thickness.

y = Constant((3 * lx, 3 * lx))
δb = Constant(1/4)
b = δb * max_value(0, 1 - inner(x - y, x - y) / r**2)

In order to make the initial state have a flat surface, we'll subtract off the bottom topography from the initial thickness.

z0 = firedrake.Function(Z0)
z0.sub(0).project(h_expr - b);

The only thing different here is the presence of the topographic forcing term.

def F(z):
    return (
        cell_flux(z) +
        central_facet_flux(z) +
        lax_friedrichs_facet_flux(z) -
        topographic_forcing(z, b)
    )

solver = ForwardEuler(z0, F)
hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
0.789, 1.004: 100%|██████████| 2000/2000 [00:08<00:00, 240.98it/s]
make_animation(
    hs, b, timestep, output_freq, vmin=0.96, vmax=1.04
)

This is even worse than before; the resolution of the bump at the bed is poor enough that there's a permanent disturbance around it. The energy drift has a very different character this time around, instead oscillating around a higher value than the initial one.

energies = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * (h + b)**2) * dx
    )
    for h, q in zip(hs, qs)
]

fig, axes = plt.subplots()
axes.plot(energies);
DG(1) basis

Now let's try increasing the resolution by using a higher-order finite element basis and seeing what happens.

Q1 = firedrake.FunctionSpace(mesh, family='DG', degree=1)
V1 = firedrake.VectorFunctionSpace(mesh, family='DG', degree=1)
Z1 = Q1 * V1

We can reuse the expression objects for the starting thickness and the bed bump since they're purely symbolic.

z0 = firedrake.Function(Z1)
z0.sub(0).project(h_expr - b);

The integrator and the simulation loop are the same as they were before.

solver = ForwardEuler(z0, F)
hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
0.791, 1.012: 100%|██████████| 2000/2000 [00:31<00:00, 62.86it/s]

With the DG(1) basis, the results are much less diffusive than with DG(0), but we're still getting weird artifacts near the bed bump.

make_animation(
    hs, b, timestep, output_freq,
    shading='gouraud', vmin=0.96, vmax=1.04
)

While the DG(1) basis gives substantially better results in the eyeball norm than the DG(0) basis, the energy drift has gotten much worse. If we were to run this simulation for an even longer time, the unphysical injection of energy into the system could eventually lead to such large momenta that the timestep we used is no longer stable.

energies = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * (h + b)**2) * dx
    )
    for h, q in zip(hs, qs)
]

fig, axes = plt.subplots()
axes.plot(energies);

Now let's try a higher-order timestepping scheme.

SSPRK(3) scheme

The strong stability-preserving Runge-Kutta method of order 3 is a popular timestepping scheme for hyperbolic equations. The 3rd-order convergence is obviously a big advantage in addition to simplicity of implementation. We can implement a timestepping class for it in a similar way to the Euler scheme, but with more internal state for the Runge-Kutta stages.

class SSPRK3:
    def __init__(self, state, equation):
        z = state.copy(deepcopy=True)
        dt = firedrake.Constant(1.0)
        
        num_stages = 3
        zs = [state.copy(deepcopy=True) for stage in range(num_stages)]
        Fs = [equation(z), equation(zs[0]), equation(zs[1])]
        
        Z = z.function_space()
        w = firedrake.TestFunction(Z)
        forms = [
            inner(zs[0] - z, w) * dx - dt * Fs[0],
            inner(zs[1] - (3 * z + zs[0]) / 4, w) * dx - dt / 4 * Fs[1],
            inner(zs[2] - (z + 2 * zs[1]) / 3, w) * dx - 2 * dt / 3 * Fs[2]
        ]
        
        problems = [Problem(form, zk) for form, zk in zip(forms, zs)]
        solvers = [Solver(problem, **parameters) for problem in problems]
        
        self.state = z
        self.stages = zs
        self.timestep = dt
        self.solvers = solvers
    
    def step(self, timestep):
        self.timestep.assign(timestep)
        for solver in self.solvers:
            solver.solve()
        self.state.assign(self.stages[-1])
z0 = firedrake.Function(Z1)
z0.sub(0).project(h_expr - b)

solver = SSPRK3(z0, F)

Since there are now three Runge-Kutta stages, the simulation takes about 3x as long to run.

hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
0.784, 1.014: 100%|██████████| 2000/2000 [01:31<00:00, 21.84it/s]

With the timestep we've chosen, the results of using the SSPRK3 scheme are visually indistinguishable from those of the explicit Euler scheme. This improved method has basically the same energy drift as explicit Euler even though it is formally of higher order in time.

energies = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * (h + b)**2) * dx
    )
    for h, q in zip(hs, qs)
]

fig, axes = plt.subplots()
axes.plot(energies);

Since the SSPRK3 scheme is more accurate, we would ideally be able to use a timestep that's several times larger and get more accuracy at less cost if we wanted to. But this scheme has the same CFL requirements as explicit Euler, so if we wanted to do that we'd also have to reduce the resolution in space. There is a 4-stage, 3rd-order accurate SSPRK scheme that has twice as large a stability region which does allow us to be more flexible in what tradeoffs we make while using the same spatial discretization. Nonetheless, even that scheme won't help reduce the energy drift that we've incurred by going to the more sophisticated DG(1) discretization in space.

Hipster elements

Just going to a higher-order time discretization didn't substantially reduce the energy drift. Let's see what happens if we consider a different spatial discretization instead. The idea behind compatible finite element discretizations is to use function spaces for the thickness and momentum fields such that the divergence operator maps the momentum space exactly onto the thickness space. Compatible elements tend to recover many of the beneficial properties of staggered finite difference grids. If that subject is unfamiliar to you, I learned about it from Dale Durran's book. I learned about compatible finite elements from Colin Cotter, who's done a lot of the work developing this idea and demonstrating its usefulness on practical problems. I can't do justice to the subject here, but if you want to read more Natale et al. (2016) gives a great review.

The same elements that work well for the mixed form of the Poisson equation tend to also work well for the shallow water equations, which is lucky because people have been studying which elements work for mixed Poisson since the 70s. Here, we'll use the Brezzi-Douglas-Fortin-Marini element or BDFM(2) for short.

DG1 = firedrake.FunctionSpace(mesh, family='DG', degree=1)
BDFM2 = firedrake.FunctionSpace(mesh, family='BDFM', degree=2)
Z = DG1 * BDFM2
z0 = firedrake.Function(Z)
z0.sub(0).project(h_expr - b)

solver = ForwardEuler(z0, F)

The BDFM element family has more degrees of freedom than DG(1) and requires a more involved transformation to the reference element, so we should expect that this scheme will be substantially more expensive than the schemes we've used so far. On top of that, since the velocity space uses polynomials of higher degree than 1, we'll also incur a greater penalty from CFL by needing a smaller timestep. We've used a timestep so far of 1/200. We could get away with a timestep of 1/100 for the DG(1)/DG(1) discretization but not for BDFM(2)/DG(1).

hs, qs = run_simulation(solver, final_time, num_steps, output_freq)
0.793, 1.024: 100%|██████████| 2000/2000 [05:24<00:00,  6.17it/s]

While this discretization was much more expensive than DG(1), we make up for it by cutting the energy drift substantially.

energies = [
    firedrake.assemble(
        0.5 * (inner(q, q) / h + g * (h + b)**2) * dx
    )
    for h, q in zip(hs, qs)
]

fig, axes = plt.subplots()
axes.plot(energies);

If we really wanted close to exact energy conservation, we could explore symplectic schemes. The disadvantage of symplectic schemes is that they're typically implicit, for example the implicit midpoint rule.

Conclusion

Hyperbolic systems of conservation laws, like the shallow water equations, introduce a huge amount of new complexity compared to scalar problems. Here we've looked at some of the issues around the numerical flux and the choices of time and space discretization. Going to higher order in space gave us more accurate solutions but introduced an additional energy drift that could be a serious problem in long-running simulations. Using a more accurate time discretization didn't reduce the drift at all. We had to use a different and even more sophisticated spatial discretization in order to reduce this effect.

We've focused here on explicit timestepping schemes. These have the virtue of being particularly simple to implement. In other scenarios, however, you may end up stuck with a mesh that's finer than you'd like. The CFL condition can then become oppressive if the wave speeds across finer regions grow too high. In the posts that follow, we'll dig into this issue even more and start looking at implicit and Rosenbrock-type schemes.

Convection-diffusion

In a previous post, we looked at how to discretize conservation laws using the discontinuous Galerkin method. The DG method, with an appropriate choice of numerical flux and limiting scheme, can be a great way to solve purely hyperbolic problems. But many realistic problems aren't purely hyperbolic -- there's diffusive transport as well:

$$\frac{\partial\phi}{\partial t} + \nabla\cdot \phi u = \nabla\cdot k\nabla \phi.$$

where $\phi$ is the solution, $u$ is a velocity field, and $k$ is the conductivity. Depending on the ratio $UL / k$, where $L$ is a characteristic length scale, the problem is either advection- or diffusion-dominated.

What basis functions should we use for a problem with mixed character like convection-diffusion? It might seem that we have only bad choices. We could use continuous basis functions, which makes the diffusive part of the equation easy, but we might then need to stabilize the advective part using, say, streamlined upwinding. DG methods work perfectly well for discretizing the advection operator, but it feels a little strange to use basis functions that don't even live in the same function space as the solution. Nonetheless, it is possible to use DG for elliptic problems and we've already seen how in the form of Nitsche's method.

DG for elliptic problems

Recall that Nitsche's method gave us a way to weakly impose Dirichlet boundary conditions for elliptic PDE by modifying the variational principle rather than the discretized linear system. In the post where I introduced Nitsche's method, the application was for Dirichlet conditions at the boundary $\partial\Omega$ of the domain $\Omega$. To arrive at a DG discretization of elliptic problems, we'll instead imagine using Nitsche's method on every cell of the mesh. Rather than impose a set value on the boundary of each cell, we'll instead use Nitsche's method to penalize discontinuous solutions. For this reason the method that I'll describe is called the symmetric interior-penalty discontinuous Galerkin method.

Let's suppose we were only interested in solving the variational problem to minimize

$$J(\phi) = \frac{1}{2}\int_\Omega k|\nabla\phi|^2dx$$

subject to the boundary condition $\phi|_{\partial\Omega} = g$. This is another way of stating the weak form of the elliptic part of our problem above. Rather than enforce the Dirichlet boundary condition at the level of the discretized linear system, as is customary, we could instead use a different action functional:

$$J(\phi) = \frac{1}{2}\int_\Omega k|\nabla \phi|^2dx - \int_{\partial\Omega}k\frac{\partial \phi}{\partial n}(\phi - g)\,ds + \sum_E\int_E\frac{\gamma k}{2h}|\phi - g|^2ds,$$

where $E$ are all of the boundary faces of the mesh, $h$ the diameter of $E$, and $\gamma$ a constant. Let $\theta$ be the smallest angle between any two edges of the mesh, $d$ the spatial dimension, and $p$ the polynomial degree of the finite element basis. In the post on Nitsche's method, we showed that if

$$\gamma > 2 \cdot p \cdot (p + d - 1) \cdot \csc\theta\cdot\cot\theta / 2\cdot\frac{\max k}{\min k}$$

the modified action functional is convex and we can be assured that there is a discrete solution. The advantage of Nitsche's method is that we don't have to modify the discretized linear system, which can be error-prone, and that we can compute a good value of $\gamma$ just by examining the mesh and conductivity coefficient.

The idea behind DG discretization of elliptic problems is that, instead of using Nitsche's method to enforce a Dirichlet condition at the boundary of the whole domain, we use it to force the solution to be continuous across element boundaries. Rather than match $q$ to some externally defined function $g$, we instead match the value $q_+$ on one side of a facet $E$ to the value $q_-$ on the other side; terms like $q - g$ instead become $[q] = q_+ - q_-$, where we've introduced the shorthand $[\cdot]$ to denote the jump across an inter-element boundary. In that case, the action functional becomes

$$J(\phi) = \frac{1}{2}\sum_K\int_K k|\nabla\phi|^2dx + \sum_E\int_Ek\left[\frac{\partial\phi}{\partial n}\right][\phi]\, dS + \sum_E\int_E\frac{\gamma k}{2h}[\phi]^2dS + \ldots$$

where I've left off the remaining terms. The same considerations apply to picking $\gamma$ and we can actually get away with using the exact same value as before. To illustrate, let's try this on a toy problem.

Demonstration

We'll use the unit square as our domain. Although we know ahead of time that the minimum triangle area is $\pi / 4$, I've included some code here to calculate it. We'll need to know this value in order to get a good value of the penalty parameter.

import firedrake
nx, ny = 32, 32
mesh = firedrake.UnitSquareMesh(nx, ny)
import numpy as np
from numpy.linalg import norm

coords = mesh.coordinates.dat.data_ro
cells = mesh.coordinates.cell_node_map().values

θ = np.inf
for cell in cells:
    for k in range(3):
        x, y, z = coords[np.roll(cell, k)]
        ζ, ξ = y - x, z - x
        angle = np.arccos(np.inner(ζ, ξ) / (norm(ζ) * norm(ξ)))
        θ = min(angle, θ)

For boundary values, we'll use a random trigonometric polynomial.

from numpy import pi as π
from numpy.random import default_rng
x, y = firedrake.SpatialCoordinate(mesh)

rng = default_rng(seed=1)
def random_fourier_series(std_dev, num_modes, exponent):
    from firedrake import sin, cos
    A = std_dev * rng.standard_normal((num_modes, num_modes))
    B = std_dev * rng.standard_normal((num_modes, num_modes))
    expr = sum([(A[k, l] * sin(π * (k * x + l * y)) +
                 B[k, l] * cos(π * (k * x + l * y)))
                / (1 + (k**2 + l**2)**(exponent/2))
                for k in range(num_modes)
                for l in range(int(np.sqrt(num_modes**2 - k**2)))])
    return expr
g = random_fourier_series(std_dev=.25, num_modes=5, exponent=1)
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
V = firedrake.FunctionSpace(mesh, family='CG', degree=2)
fig = plt.figure()
axes = fig.add_subplot(projection='3d')
firedrake.trisurf(firedrake.interpolate(g, V), axes=axes);

For basis functions, we'll use polynomials of degree 1. As we saw in the previous post on hyperbolic conservation laws, these gave very good numerical solutions provided that we had a good timestepping scheme. Now that we know the polynomial degree and the mesh quality, we can calculate the penalty parameter for Nitsche's method. We've put in a fudge factor $\alpha$ to make sure the penalty parameter is strictly greater than the minimum possible value.

from firedrake import Constant
p = 1
α = 1/2
C = p * (p + 1)
γ = Constant(2 * C / α**2 / (np.sin(θ) * np.tan(θ/2)))

The action functional for the DG formulation of the problem consists of five parts: an internal cell flux; the across-cell flux; the across-cell penalty; the boundary flux; and the boundary penalty.

from firedrake import inner, grad, dx, ds, dS

Q = firedrake.FunctionSpace(mesh, family='DG', degree=p)
n = firedrake.FacetNormal(mesh)
h = firedrake.CellSize(mesh)

ϕ = firedrake.project(g, Q)
dϕ_dn = inner(grad(ϕ), n)

J_cells = 0.5 * inner(grad(ϕ), grad(ϕ)) * dx
J_facet_flux = -(dϕ_dn('+') - dϕ_dn('-')) * (ϕ('+') - ϕ('-')) * dS
J_facet_penalty = 0.5 * γ / (h('+') + h('-')) * (ϕ('+') - ϕ('-'))**2 * dS

J_flux = -dϕ_dn * (ϕ - g) * ds
J_penalty = 0.5 * γ * (ϕ - g)**2 / h * ds

J = (
    J_cells +
    J_facet_flux +
    J_facet_penalty +
    J_flux +
    J_penalty
)

Finally, we can solve the PDE just like we did before. Note that, since we're using discontinuous piecewise linear basis functions, there are three degrees of freedom for every triangle of the mesh. Had we used continuous piecewise linear basis functions, the solution would have one degree of freedom for every vertex, which is much fewer than for the DG basis. Consequently we can expect the solution process to take much longer when using DG.

F = firedrake.derivative(J, ϕ)
parameters_lu = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'lu',
        'pc_factor_mat_solver_type': 'mumps'
    }
}
firedrake.solve(F == 0, ϕ, **parameters_lu)
fig = plt.figure()
axes = fig.add_subplot(projection='3d')
triangles = firedrake.trisurf(ϕ, axes=axes)
fig.colorbar(triangles);

For comparison, we can compute the solution using continuous basis functions.

CG = firedrake.FunctionSpace(mesh, family='CG', degree=p)
ψ = firedrake.Function(CG)
J = 0.5 * inner(grad(ψ), grad(ψ)) * dx

F = firedrake.derivative(J, ψ)
bc = firedrake.DirichletBC(CG, g, 'on_boundary')
firedrake.solve(F == 0, ψ, bc, **parameters_lu)
firedrake.norm(ϕ - ψ) / firedrake.norm(ψ)
0.012794504299311067

The relative error is quite small, which suggests that our approach isn't totally off base. We're now ready to introduce timestepping, which gets quite interesting for problems of mixed character like ours.

IMEX methods

The natural way to timestep the heat equation is to use an implicit scheme like backward Euler. If you did try to use an explicit method, the CFL condition becomes oppressive for parabolic problems because perturbations propagate with infinite speed. For the advection equation, on the other hand, the natural approach is to use an explicit scheme because perturbations travel with finite speed. In our case we'll simplify things considerably by assuming the velocity of the medium doesn't change in time, but for realistic problems it too obeys some PDE and we have to assume that it can change. The argument for using explicit schemes becomes much stronger when the velocity can change because it becomes much harder to reuse information from past implicit solves.

We can reconcile the fact that different timestepping schemes work best for each part of the equation by using a splitting method. The idea of splitting methods is to separately integrate each part of the dynamics with a method that works best for it. To illustrate this it's helpful to imagine the problems as a linear system of ODE

$$\frac{\partial\phi}{\partial t} = (A + B)\phi$$

where $A$ and $B$ are operators. The formal solution to this problem can be expressed using the matrix exponential:

$$\phi(t) = e^{t(A + B)}\phi(0)$$

but we're then faced with the challenge of simultaneously integrating both parts of the dynamics. If $A$ and $B$ commute, then the exponential factors into a product, but we have to get very lucky for that to happen. Instead, we can recognize that

$$\exp\{t(A + B)\} = \exp(tA)\exp(tB)\exp\left(-\frac{t^2}{2}[A, B]\right)\cdot \ldots$$

In other words, to approximate integrating the problem forward by one timestep, we can first integrate the $A$ dynamics, then the $B$ dynamics. There is a cost in doing so: we incur a splitting error proportional to the commutator $[A, B]$ of the two operators. But this error is only of the order of $\delta t^2$ for a single timestep and thus on the order of $\delta t$ over all. We can derive higher-order splitting schemes that incur even less of a cost, often using the Zassenhaus formula. The equation I wrote above is the Zassenhous formula up to second order.

In our particular case, we know that using implicit Euler will work well for the parabolic part of the problem and explicit Euler (with a CFL-stable timestep) will work well for the hyperbolic part. A first-order splitting scheme amounts to using each scheme in succession. This approach to mixed-type problems is often called an implicit/explicit or IMEX discretization in the literature.

Demonstration

We'll use a very similar setup to the post on conservation laws to examine a simple first-order IMEX scheme for the convection-diffusion equation. The velocity field will be uniform solid body rotation about the center of the domain. In order to keep the solution from being immediately smoothed over, we'll use a conductivity coefficient of 1/1000. This puts the problem well in the advection-dominated regime. Nonetheless, the diffusivity is large enough that we won't need to use a flux limiter; any spurious oscillations will naturally get smoothed out.

p = 1
Q = firedrake.FunctionSpace(mesh, family='DG', degree=p)
from firedrake import sqrt, as_vector, min_value, max_value

x = firedrake.SpatialCoordinate(mesh)
x_c = Constant((5/8, 5/8))
R_c = Constant(1/8)

x_b = Constant((3/8, 3/8))
R_b = Constant(1/8)

ϕ_expr = (
    max_value(0, 1 - sqrt(inner(x - x_c, x - x_c) / R_c**2)) +
    max_value(0, 1 - inner(x - x_b, x - x_b) / R_b**2)
)
ϕ_0 = firedrake.project(ϕ_expr, Q)
ϕ = ϕ_0.copy(deepcopy=True)
x = firedrake.SpatialCoordinate(mesh)
y = Constant((.5, .5))
r = x - y
u = as_vector((-r[1], r[0]))
speed = firedrake.interpolate(sqrt(inner(u, u)), Q)
max_speed = speed.dat.data_ro.max()

Q0 = firedrake.FunctionSpace(mesh, family='DG', degree=0)
diameters = firedrake.project(firedrake.CellDiameter(mesh), Q0)
min_diameter = diameters.dat.data_ro.min()

cfl_timestep = min_diameter / max_speed

final_time = 2 * π
num_steps = 4 * int(final_time / cfl_timestep)
dt = firedrake.Constant(final_time / num_steps)

First, we'll create a solver for the diffusive part of the problem. We can take advantage of the fact that the implicit Euler timestep for a parabolic PDE is itself a minimization problem to keep our description as succinct as possible.

J_mass = 0.5 * (ϕ - ϕ_0)**2 * dx

k = Constant(1e-3)
dϕ_dn = inner(grad(ϕ), n)
J_cells = 0.5 * inner(grad(ϕ), grad(ϕ)) * dx
J_facet_flux = -(dϕ_dn('+') - dϕ_dn('-')) * (ϕ('+') - ϕ('-')) * dS
J_facet_penalty = 0.5 * γ / (h('+') + h('-')) * (ϕ('+') - ϕ('-'))**2 * dS

J = J_mass + dt * k * (J_cells + J_facet_flux + J_facet_penalty)

F = firedrake.derivative(J, ϕ)
heat_problem = firedrake.NonlinearVariationalProblem(F, ϕ)
heat_solver = firedrake.NonlinearVariationalSolver(heat_problem, **parameters_lu)

Next we'll create a solver for the convective part of the problem.

from firedrake import min_value, max_value
m = firedrake.TrialFunction(Q) * firedrake.TestFunction(Q) * dx

ψ = firedrake.TestFunction(Q)
cell_flux = -inner(grad(ψ), ϕ * u) * dx

f = ϕ * max_value(0, inner(u, n))
facet_flux = (f('+') - f('-')) * (ψ('+') - ψ('-')) * dS

ϕ_in = firedrake.Constant(0)
influx = ϕ_in * min_value(0, inner(u, n)) * ψ * ds
outflux = ϕ * max_value(0, inner(u, n)) * ψ * ds

dϕ_dt = -(cell_flux + facet_flux + influx + outflux)
δϕ = firedrake.Function(Q)
flow_problem = firedrake.LinearVariationalProblem(m, dt * dϕ_dt, δϕ)
parameters_bjac = {
    'solver_parameters': {
        'ksp_type': 'preonly',
        'pc_type': 'bjacobi',
        'sub_pc_type': 'ilu'
    }
}
flow_solver = firedrake.LinearVariationalSolver(flow_problem, **parameters_bjac)

Now we can try the alternating scheme.

import tqdm

output_freq = 4
ϕs = []

for step in tqdm.trange(num_steps, unit='timesteps'):
    flow_solver.solve()
    ϕ += δϕ
    
    ϕ_0.assign(ϕ)
    heat_solver.solve()
    
    if step % output_freq == 0:
        ϕs.append(ϕ.copy(deepcopy=True))
100%|██████████| 400/400 [00:12<00:00, 31.92timesteps/s]
%%capture
fig, axes = plt.subplots()
axes.set_aspect('equal')
axes.get_xaxis().set_visible(False)
axes.get_yaxis().set_visible(False)
colors = firedrake.tripcolor(ϕ, num_sample_points=1, vmin=0., vmax=1., axes=axes)
from firedrake.plot import FunctionPlotter
fn_plotter = FunctionPlotter(mesh, num_sample_points=1)
from matplotlib.animation import FuncAnimation
def animate(ϕ):
    colors.set_array(fn_plotter(ϕ))

interval = 1e3 * output_freq * float(dt)
animation = FuncAnimation(fig, animate, frames=ϕs, interval=interval)
from IPython.display import HTML
HTML(animation.to_html5_video())

Unlike in the purely hyperbolic case, the solution becomes much smoother by the time the medium has completed a full rotation. If we had used an even smaller conductivity coefficient $k$, the intrinsic diffusion in the problem would be too small to suppress the oscillations that we would get for a purely convective problem using degree-1 basis functions. In that case a flux limiter would be necessary to get a decent solution.

Conclusion

The natural choices to make about spatial and temporal discretization for parabolic and hyperbolic problems are complete opposites. To solve a PDE that combines aspects of both types, finding a good middle ground can be a challenge. Here we showed that non-conforming discretizations of elliptic PDE allow us to apply what we already know about DG discretizations of conservation laws to other problems.

For time discretization, operator splitting helps turn a harder problem into two easy ones. In this post we only showed a first-order splitting scheme, but Strang splitting achieves second-order accuracy with only a little more work. Operator splitting is a great trick generally but achieving more than first-order accuracy for more than three operators or more is much harder than for two.