Particle Filters, Particle MCMC, and All That

Sequential Monte Carlo (SMC) methods, (also known as Particle Filters), are a class of inference algorithms that I’ve learned, forgotten, and had to re-learn many many times over. As I was teaching myself how SMC works for the nth time last week, I became frustrated enough to write a blog post about it, ensuring that I will (hopefully) remember it for good, this time. Most of my derivations will follow this excellent review.

State Space Models
Sequential Importance Sampling
Resampling
Choice of Proposal
Particle MCMC and Particle Gibbs
Wrap Up

State Space Models

I’ll be using a state space model as the prototypical model in all of our discussions about particle filters and SMC. In many ways, state space models are the natural prototype model to discuss when talking about SMC; their temporal and autoregressive structure makes it challenging to sample for traditional MCMC methods, but the markovian assumption and conditional independence of observations makes them still tractable enough, given the right method. Although, I should mention that SMC methods are not limited to models with some type of temporal dependence; their use case extends far beyond the scope of state space models, and they can even be used as a drop in replacement for MCMC methods on static models.

Just for a quick recap, a state space model consists of a sequence of latent variables \(x_1, \ldots x_T\), observations \(y_1, \ldots y_T\), related via the following formula:

\[x_1 \sim p(x_1)\\ x_n \sim p(x|x_{n-1}, \theta)\\y_n \sim p(y|x_n, \theta).\]

Above, we are going to treat \(\theta\) as fixed and given, although we will relax this assumption when we talk about Particle MCMC methods. Thus, our main task is to infer the sequence of distributions

\[\pi_n(x_{1:n}) = p(x_{1:n}|y_{1:n}) = \frac{1}{Z}\gamma_n(x_{1:n}),\]

where \(Z\) is the marginal likelihood (or partition function if you’re a physicist), and \(\gamma_n\) is the joint distribution of \(x\) and \(y\).

Sequential Importance Sampling

The simplest type of SMC is sequential importance sampling, where we sample a set of points, propogate them through time, and associate each point with a weight. In particular, I’m assuming that you’re already familiar with regular, old fashioned importance sampling.

What makes SIS, well, sequential, is our choice of proposal distribution. We could choose a proposal \(q(x_{1:T})\) that proposed an entire trajectory all at once, but this would be a horrible idea from an efficiency standpoint, particularly as \(T\) gets large. Instead, we are going to choose an proposal distribution that has a markovian decomposition, so that we can efficiently propose values, one time step at a time.

More formally, we’ll choose the following form for our proposal distribution:

\[q(x_{1:T}) = q(x_1)\prod_{i=2}^Tq(x_i|x_{i-1}).\]

Thus, to sample a trajectory \(X_{1:T}\), we first sample \(X_1 \sim q(x_1)\), then sample \(X_2 \sim q(x_2 \vert x_1)\), and so on.

We associate to each trajectory \(X_{1:T}^i\) a weight \(w_i\), where

\[w_i(x_{1:T}) = \frac{\gamma_n(x_{1:T})}{q(x_{1:T})}.\]

The structure of our proposal distribution allows us to calculate these weights on the fly, via the following relation:

\[\begin{aligned} w_t(x_{1:t}) &= \frac{\gamma_t(x_{1:t})}{q(x_{1:t})} \\ &=p(x_t, y_t \vert x_{1:t-1}, y_{1:t-1})\frac{\gamma_{t-1}(x_{1:t-1})}{q(x_{1:t-1})q(x_t\vert x_{t-1})} \\ &=w_{t-1}(x_{1:t-1})\frac{\gamma_t(x_{1:t})}{\gamma_{t-1}(x_{1:t-1})q(x_t|x_{t-1})} \\ &=w_{t-1}(x_{1:t-1})\alpha_t(x_{1:t}). \end{aligned}\]

Therefore, the SIS algorithm turns out to be pretty simple: initialize our particles \(X_1^i\) and weights \(w_1^i = w_1(X_1^i) = \dfrac{p(y_1\vert x_1)}{q(x_1)}\), and at each step, sample \(X_t^i \sim q(x_t\vert x_{t-1})\) and compute \(w_t(X_t^i) = W_{t-1}(X_{1:t-1}^i)\alpha_t(X_{1:t-1}^i).\)

Once we have our particles \(X_{1:T}^i\), we can approximate our distribution \(\pi_n\) via the following empirical distribution:

\[\pi_T \approx \sum_{i=1}^N W_T^i \delta_{X_{1:T}^i},\]

where \(\delta\) is the dirac delta at \(X_{1:T}^i\), and \(W_T^i\) is the normalized version of \(w_T^i\). Note that you might often see it written this way:

\[\pi_T(dx_{1:T}) \approx \sum_{i=1}^N W_T^i \delta_{X_{1:T}^i}(dx_{1:T}),\]

which looks maybe a little blasphemous, but is really just meant to make clear the fact that

\[\mathbb{E}_{\pi_n}[f(x_{1:T})] \approx \int \sum_{i=1}^N f(x_{1:T}) W_T^i \delta_{X_T^i}(dx_{1:T}) = \sum_{i=1}^N W_T^i f(X_{1:T}^i).\]

(here’s a simple analysis exercise: prove that for the dirac measure \(\delta_x(A) = \mathbb{I}(x \in A)\), \(\int f(x) \delta_p(dx) = f(p).\))

The empirical distribution \(\pi_T\) can be used to perform smoothing, while the sequential distributions \(\pi_t\) can be used to perform filtering, approximating \(p(x_t \vert y_{1:t})\).

An added benefit of SIS is that we obtain an estimate of the marginal likelihood:

\[Z_t = \int \gamma_n(x_{1:t}) dx_{1:t} \approx \sum_{i=1}^N w_t^i = \hat{Z}_n.\]

This will come in handy later.

Resampling

So far the SIS algorithm is looking pretty good, but there’s a couple of fatal flaws with it. The primary drawback is that variance of the particles scales exponentially with \(T\) - trying to forecast a model even a handful of steps into the future will give you an estimate that is unbiased, but is of such high variance that it is effictively useless.

The solution, is to resample the particles according to the weights \(W_t^i\), which corresponds to sampling \(N\) times from the empirical distribution \(\hat{\pi}_t\) (this is why we call it resampling - because we are sampling from an approximation that was itself obtained by sampling). After resampling, we are left with a uniform random sample from \(\hat{\pi_t}\), and so our approximation becomes

\[\pi_T(dx_{1:T}) \approx \frac{1}{N}\sum_{i=1}^N \delta_{X_{1:T}^i}(dx_{1:T}).\]

note that resampling does have its own problems - in particular degeneracy, where only several particles are given a non-insignificant weight, and thus the set of trajectories degenerates into just several unique trajectories.

Choice of proposal

How should we choose the proposal distributions \(q(x_t \vert x_{t-1})\)? The natural choice is to simply re-use the model distributions, and set

\[q(x_t\vert x_{t-1}) = p(x_t\vert x_{t-1}),\\ q(x_1) = p(x_1).\]

This is probably the most common proposal distribution used, but it turns out that if your goal is to minimize variance of the particle trajectories, the optimal proposal distribution is actually:

\[q(x_t|x_{t-1}) = p(x_t|y_t, x_{t-1}) \propto p(x_t|x_{t-1})p(y_t|x_t).\]

Many times, this distribution is not tractable because the normalizing constant cannot be evaluated, but if it can, then this distribution should be preferred.

There is a whole literature on designing good proposal distributions - scented and unscented kalman filters, et cetera, and we won’t really talk about them here.

Particle MCMC and Particle Gibbs

Let’s now consider the general setting, where in addition to the latent states \(x_t\), we also have some parameter \(\theta \sim p(\theta)\) that we wish to estimate the posterior of. Again, a naive approach would be to sample \((\theta, x_{1:T})\) jointly using a normal transition kernel, but this would fairly quickly run into mixing problem as \(T\) increases.

Particle MCMC (PMCMC) is a class of algorithms designed to tackle this problem, by combining SMC and MCMC samplers to construct general transition kernels for a broad class of statistical models with temporal (or other kinds of) dependence.

The intuition is as follows. We would like to sample \(x_{1:T}\) from an SMC sampler, but also marginalize out \(x_{1:T}\) in order to sample from \(p(\theta \vert y_{1:T})\). Now, the latter distribution is obviously intractable since it involves calculating a T dimensional integral, but recall - as a byproduct of our SMC algorithm, we obtained the following approximation to the marginal likelihood:

\[Z_t = p(y_{1:T}|\theta) \approx \sum_{i=1}^N w_t^i.\]

With this marginal likelihood estimate and a transition kernel \(q(\theta’ \vert \theta)\), we can construct a metropolis hastings accept/reject step for \(\theta\), where we accept the new sample \(\theta’\) with probability

\[\frac{p(y_{1:T}|\theta')p(\theta')q(\theta'|\theta)}{p(y_{1:T}|\theta)p(\theta)q(\theta|\theta)} \approx \frac{\hat{Z}_N'p(\theta')q(\theta|\theta')}{\hat{Z}_Np(\theta)q(\theta'|\theta)}.\]

It may not be obvious that this will actually work, but it turns out that for fixed N, this algorithm will actually converge to the true posterior - as long as we make a couple of adjustments.

So, the general Particle MCMC algorithm is as follows: at each step, first sample a new parameter \(\theta^* \sim q(\theta^* \vert \theta)\), and then run SMC conditioned on \(\theta^*\) to obtain a set of particles \(X_{1:T}^i\). We sample a trajectory \(X_{1:T}^* \sim \hat{\pi}_n\) and compute the corresponding marginal likelihood estimate \(\hat{Z}_n^*\), and then accept the tuple \((\theta^*, X_{1:T}^*)\) using a metropolis hastings proposal, with acceptance probability

\[\frac{\hat{Z}_N'}{\hat{Z}_n}\frac{p(\theta^*)q(\theta|\theta^*)}{p(\theta)q(\theta^*|\theta)}.\]

where \(\hat{Z}_n\) is the marginal likelihood of the previous sampled trajectory and the previous parameter estimate. Note that because we only sample a single trajectory from the set of trajectories, PMCMC does not suffer as much from the particle degeneracy problem.

Alternatively, instead of sampling \(\theta\) and the trajectory jointly, we can block sample them separately, leading us to the Particle Gibbs algorithm. Particle Gibbs is conceptually almost the same as Particle MCMC - just sample \(X_{1:T}^*\) conditioned on \(\theta\), and then \(\theta^*\) after approximately integrating out \(X_{1:T}^*\), but in order for theoretical guarantees (the invariance of the proposal distribution) to be met, we need to modify our SMC algorithm just a bit. The only modification that we need to make is to introduce a reference sample \(X_{1:T}^N\), which is guaranteed to survive all resampling steps in our SMC algorithm. Don’t ask me why adding a reference sample leaves the joint proposal distribution invariant - I don’t know why and you’d be better off reading the proof here.

Wrap up

Sequential Monte Carlo methods are incredibly useful, but surprisingly underrated and underused outside of its traditional domain of signal processing, time series, and robotics. In particular, I think state space models offer huge value as computational tools for psychology and neuroscience, both as a way to analyze data, but also serving as models of cognition. And, with the advent of PMCMC methods, the previously intractable problem of simultaneously estimating \(\theta\) and \(x_{1:T}\) has been allieviated significantly.

<3