Particle filter, smoother and parameter learning

$\require{cancel}$
 

Let $X_{t}$ be the hidden states and $y_{t}$ be the observations of a discrete-time state-space model (Kalman filter and hidden Markov model) identified by a transition probability $P(X_{t+1}|X_{t})$ and an observation model $P(Y_{t}|X_{t})$, where $t=1,\cdots,T$. Particle filtering for making inferences about hidden states $X_{t}$ from observations $y_{1:t}$ is comprised of a predicting/sampling step to sample particle positionss $x_{t+1}^{1}\sim P(X_{t+1}|x_{t}^{i_{t}^{1}}),\cdots,x_{t+1}^{N}\sim P(X_{t+1}|x_{t}^{i_{N}^{1}})$ at time $t+1$ from the particles $x_{t}^{i_{t}^{1}},\cdots,x_{t}^{i_{t}^{N}}$ at time $t$, and an updating/resampling step to sample particle indices according to their likelihoods with respect to the observations $i_{t+1}^{1},\cdots,i_{t+1}^{N}\sim\mbox{Categorical}(P(y_{t+1}|x_{t+1}^{1}),\cdots,P(y_{t+1}|x_{t+1}^{N}))$, from the starting particle position $x_{0}^{1},\cdots,x_{0}^{N}=x_{0}$ and indices $i_{0}^{1}=1,\cdots,i_{0}^{N}=N$. The particle approximation of the filtering probability is $\hat{p}(x_{t}|y_{1,\cdots,t})=\frac{1}{N}\sum_{k}\delta(x_{t}^{i_{t}^{k}})\stackrel{N\to\infty}{\longrightarrow}p(x_{t}|y_{1,\cdots,t})$, and the particle approximation of the likelihood function is $\hat{p}(y_{1,\cdots,T})=\prod_{t}\hat{p}(y_{t}|y_{0,\cdots,t-1})=\prod_{t}\frac{1}{N}\sum_{k}\hat{p}(y_{t}|\xi_{t}^{k})\stackrel{N\to\infty}{\longrightarrow}p(y_{1,\cdots,T})$. The conditional probability of particle locations and indices from particle filtering $\psi^{N}$; the approximate likelihood of the of the particle locations $Z_n^N$ are given in the following. \begin{eqnarray} Z_{n}^{N} = \hat P(y_0,\cdots,y_n) & = & \left[{1\over N}\sum_{l=1}^{N}p(y_{0}|\xi_{0}^{l})\right]\cdot\prod_{k=1}^{n}\left[{1\over N}\sum_{l=1}^{N}p(y_{k}|\xi_{k}^{l})\right]\to Z_{n} = P(y_0)\cdot\prod_{k=1}^{n}p(y_{k}|y_{1,\cdots,k-1})\label{Z_approx}\\ \psi^{N}=P(\vec{\xi}\_{0},\cdots,\vec{\xi}\_{n},\vec{i}\_{0},\cdots,\vec{i}\_{n}|y_0,\cdots,y_n) & = & \left[\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\prod_{l=1}^{N}\frac{p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\sum_{l=1}^{N}p(y_{0}|\xi_{0}^{l})}\right]\cdot\prod_{k=1}^{n}\left[\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\prod_{l=1}^{N}\frac{p(y_{k}|\xi_{k}^{i_{k}^{l}})}{\sum_{l=1}^{N}p(y_{k}|\xi_{k}^{l})}\right]\label{psi}\\ \end{eqnarray} There are two common particle smoothing methods for making inferences about hidden states $X_{t}$ from observations $y_{1:T}$. The genealogical tree algorithm and the forward filtering backward sampling algorithm. Genealogical tree algorithm consists of iteratively tracing the ancestral line of particle positions for $x_{T}^{i_{T}^{1}},\cdots,x_{T}^{i_{T}^{N}}$: \[ x_{1,T}=x_{1}^{i_{1}^{i_{2}^{\cdot^{\cdot^{\cdot i_{T}^{k}}}}}},\cdots,x_{T,T}=x_{T-1}^{i_{T-1}^{i_{T}^{k}}},x_{T}^{i_{T}^{k}}\mbox{ where }k=1,\cdots,N. \] The ancestral lines including the particle positions at time $T$ form a particle approximation of the posterior distribution of state trajectory \[ \hat{p}(X_{1,\cdots,T}|y_{1,\cdots,T})=\frac{1}{N}\sum_{k}\delta_{x_{1,T}^{k},\cdots,x_{T,T}^{k}}(X_{1,\cdots,T})\stackrel{N\to\infty}{\longrightarrow}p(X_{1,\cdots,T}|y_{1,\cdots,T}). \] Hence the particles $x_{t,T}^{1},\cdots,x_{t,T}^{N}$ form an approximation of the smoothing probability $\hat{p}(X_{t}|y_{1,\cdots,T})=\frac{1}{N}\sum_{k}\delta_{x_{t,T}^{k}}(X_{t})\stackrel{N\to\infty}{\longrightarrow}p(X_{t}|y_{1,\cdots,T})$. A well known problem of the genealogical tree algorithm is that all particles at time $T$ will quickly converge to the same ancestors in their ancestral lines and subsequently the genealogical tree provides a poor approximation to the smoothing distribution $P(X_{t}|y_{1,\cdots,T})$ due to insufficient sample size. Forward filtering backward sampling Backward consists of resampling particle indices backward in time $j_{n}^{k}\sim\mbox{Categorical}(P(y_{n}|x_{n}^{1}),\cdots,P(y_{n}|x_{n}^{N}))$, and $j_{t}^{k}\sim\mbox{Categorical}(P(x_{t+1}^{j_{t+1}}|x_{t}^{1})P(y_{t}|x_{t}^{1}),\cdots,P(x_{t+1}^{j_{t+1}}|x_{t}^{N})P(y_{t}|x_{t}^{N}))$ for $t=n-1,\cdots,0$. The resulting particle trajectories $x_{1}^{i_{1}^{j_{1}^{k}}},\cdots,x_{T}^{i_{T}^{j_{T}^{k}}}$ form an approximation of the posterior distribution of the state trajectory and particles $x_{t,T}^{1},\cdots,x_{t,T}^{N}$ form an approximation of the smoothing probability. The conditional probability of particle locations and indices and a particle trajectory from backward resampling is consequently \begin{equation} k^{N}=P(\vec{\xi}\_{0}:\vec{\xi}\_{n},\vec{i}\_{0}:\vec{i}\_{n},j_{0}:j_{n}|y_{1:n})=\psi^{N}(\vec{\xi}\_{0},\cdots,\vec{\xi}\_{n},\vec{i}\_{0},\cdots,\vec{i}\_{n})\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\cdot\prod_{k=0}^{n-1}\frac{P(\xi_{t+1}^{j_{n+1}}|\xi_{t}^{j_{n}})P(y_{t}|\xi_{t}^{j_{n}})}{\sum_{l=1}^{N}P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{l})P(y_{t}|\xi_{t}^{l})}.\label{k} \end{equation} The genealogical tree algorithm and the forward filtering backward sampling algorithm both generate biased approimation of the posterior distribution of state trajectory. To remove the bias, we can resort to the Metropolis-Hastings algorithm or the importance sampling algorithm. The [Metropolis–Hastings algorithm][1] is a Markov chain Monte Carlo algorithm to draw a sequence of random samples from a probability distribution $P(x)$ usually on a very high dimensional space, provided that $P(x)$ is known up to a normalization factor. The lax requirement that $P(x)$ is known up to a normalization factor makes the Metropolis–Hastings algorithm particularly useful, because the normalization factor is often extremely difficult to calculate. The M-H algorithm proceeds by randomly moving around the sample space, often staying in high-density regions of $P(x)$ and occasionally exploring low-density regions according to the acceptance ratio $\alpha$ that compares the target distribution $P(x)$ and a proposal distribution. The <a>importance sampling</a> algorithm is a variance reduction technique to integrate a function against a probability distribution $P$, where the non-zero region of the function has a very small probability. To address the problem that a sample of distribution $P$ has insufficient number of points in the important region, we sample from another distribution $q$ that overweights this important region, and weight the sample points according to how the target distribution $P$ compares with the proposal distribution $q$. In either Metropolis-Hastings algorithm and importance sampling algorithm, we need to compare the proposal distribution and the target distribution. We proceed to show that the probability for us to get a particle trajectory from either the genealogical tree algorithm or the FFBS algorithm times the empirical probability $Z^N_n$ (Eq. $\ref{Z_approx}$ equals the joint probability of the particle trajectory and the observations, thus is proportional to the posterior probability of the particle trajectory according to the original dynamics. Hence we can combine the particle trajectories from multiple particle smoothing runs with weights $Z^N_n$. To find the probability of an ancestral line $(\cdots,\xi_{n-1}^{i_{n-1}^{i_n^1}},\xi_n^{i_n^1})$ from the genealogical tree algorithm, we can integrate/sum over all other particle positions and indices $i_n^l$ for $l=\{2,\cdots,n\}$, $\xi_n^l$ for $l=\{1,\cdots,n\}\backslash\{i_n^l\}$, $i_{n-1}^l$ for $l=\{1,\cdots,n\}\backslash\{ i_n^1 \}$, $\xi_{n-1}^l$ for $l=\{1,\cdots,n\}\backslash\{ i_{n-1}^{i_n^1} \}$, …. \begin{eqnarray*} & & \sum_{\mbox{except }i_{n}^{1},i_{n-1}^{i_{n}^{1}},\cdots;\xi_{n}^{i_{n}^{1}},\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}},\cdots}\psi^{N}Z_{n}^{N}\cdot N^{n+1}\\ & = & \sum_{\mbox{except }i_{n}^{1},i_{n-1}^{i_{n}^{1}},\cdots;\xi_{n}^{i_{n}^{1}},\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}},\cdots}\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\frac{\prod_{l=1}^{N}p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{0}|\xi_{0}^{l})\right)^{N-1}}\prod_{k=1}^{n}\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\\ & = & \cdots\int d\xi_{n}^{1},\cdots,\xi_{n}^{i_{n}^{1}-1},\xi_{n}^{i_{n}^{1}+1},\cdots,\xi_{n}^{N}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})\sum_{i_{n}^{2},\cdots,i_{n}^{N}}\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\\ & = & \cdots\int d\xi_{n}^{1},\cdots,\xi_{n}^{i_{n}^{1}-1},\xi_{n}^{i_{n}^{1}+1},\cdots,\xi_{n}^{N}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})p(y_{n}|\xi_{n}^{i_{n}^{1}})\cancel{\prod_{l=2}^{N}\frac{\sum_{i_{n}^{l}=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})}}\\ & = & \cdots p(y_{n}|\xi_{n}^{i_{n}^{1}})p(\xi_{n}^{i_{n}^{1}}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})\cancel{\prod_{l\ne i_{n}^{1}}\int d\xi_{n}^{l}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})}\\ & = & p(y_{n}|\xi_{n}^{i_{n}^{1}})p(\xi_{n}^{i_{n}^{1}}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})p(\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}}|\xi_{n-2}^{i_{n-1}^{i_{n-1}^{i_{n}^{1}}}})\cdots \end{eqnarray*} To find the probability of a backward sample $\xi_0^{j_0},\cdots,\xi_n^{j_n}$, we marginalize over all other particle positions and indices $\xi_n^l$ for $l\ne j_n$, $i_n^l$ for $l=1,\cdots,N$, $x_{n-1}^l$ for $l\ne j_{n-1}$, $i_{n-1}^l$ for $l=1,\cdots,n$ …. \begin{align*} & \sum_{\text{except }j_{1:n},\xi_{1:n}^{j_{1:n}}}k^{N}Z_{n}^{N}\\ = & \sum\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\frac{\prod_{l=1}^{N}p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{0}|\xi_{0}^{l})\right)^{N-1}}\prod_{k=1}^{n}\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{k}|\xi_{k}^{i_{k}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{k}|\xi_{k}^{l})\right)^{N-1}}\cdot\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\prod_{k=0}^{n-1}\frac{P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{j_{t}})P(y_{t}|\xi_{t}^{j_{t}})}{\sum_{l=1}^{N}P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{l})P(y_{t}|\xi_{t}^{l})}\\ = & \sum\cdots\sum_{i_{n}^{1:N},\xi_{n}^{\backslash j_{n}}}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\\ = & \sum\cdots p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})\prod_{l\ne j_{n}}\bcancel{\sum_{\xi_{n}^{\backslash j_{n}}}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})}\frac{\cancel{\prod_{l=1}^{N}\sum_{i_{n}^{l}=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}}{\cancel{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}}\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\cancel{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}}\\ = & \sum\cdots\sum_{i_{n-1}^{1:N},\xi_{n-1}^{\backslash j_{n-1}}}p(\xi_{n-1}^{l}|\xi_{n-2}^{i_{n-2}^{l}})\frac{\prod_{l=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n-1}|\xi_{n-1}^{l})\right)^{N-1}}\frac{P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})}{\sum_{l=1}^{N}P(\xi_{n}^{j_{n}}|\xi_{n-1}^{l})P(y_{n-1}|\xi_{n-1}^{l})}\cdot p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})P(y_{n}|\xi_{n}^{j_{n}})\\ = & \sum\cdots p(\xi_{n-1}^{j_{n-1}}|\xi_{n-2}^{i_{n-2}^{j_{n-1}}})\bcancel{\sum_{l\ne j_{n-1}}p(\xi_{n-1}^{l}|\xi_{n-2}^{i_{n-2}^{l}})}\cancel{\sum_{i_{n-1}^{j_{n}}=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{j_{n}}})p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})}\frac{\cancel{\prod_{l\ne j_{n}}\sum_{i_{n-1}^{l}=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{l}})}}{\cancel{\left(\sum\limits \_{l=1}^{N}p(y\_{n-1}|\xi_{n-1}^{l})\right)^{N-1}}}\frac{P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})}{\cancel{\sum_{l=1}^{N}P(\xi_{n}^{j_{n}}|\xi_{n-1}^{l})P(y_{n-1}|\xi_{n-1}^{l})}}\cdot P(y_{n}|\xi_{n}^{j_{n}})\\ = & \sum\cdots\cdot p(\xi_{n-1}^{j_{n-1}}|\xi_{n-2}^{i_{n-2}^{j_{n-1}}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n}|\xi_{n}^{j_{n}})\\ = & \rho(\xi_{0}^{j_{0}})P(y_{0}|\xi_{0}^{j_{0}})\prod_{t=1}^{n}P(\xi_{t}^{j_{t}}|\xi_{t-1}^{j_{t-1}})P(y_{t}|\xi_{t}^{j_{t}}). \end{align*} Hence the $l$-th particle trajectory from the $k$-th particle smoothing run ${^k}\xi{\_{1:n}^{j\_{1:n}^l}}$ has an importance weight ${^k}Z{^N_n}$. In other words, ${1\over K}\sum_{k=1}^K {1\over L_k}\sum_{l=1}^{L_k} \delta_{{^k}\xi{\_{1:n}^{j\_{1:n}^l}}}(X_{1:n})\cdot {^k}Z{^N_n}$ is an estimate of the posterior probability $P(X_{1:n}|y_{1:n})$. The Metropolis-Hastings algorithm to sample particle trajectories from multiple particle smoothing runs work in the following way. Basis: Set the current particle trajectories from particle smoothing ${^*}\xi_{1:n}^{j_{1:n}^l}={^1}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_1$ and keep ${^*}Z^N_n={^1}Z^N_n$. Induction: Run particle smoothing to get ${^k}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_k$ and keep ${^k}Z^N_n$. Accept the new trajectories with probability $\alpha=\min(1,{{^k}Z^N_n\over {^*}Z^N_n})$ and update the current trajectories ${^*}\xi_{1:n}^{j_{1:n}^l}={^k}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_k$ and ${^*}Z^N_n={^k}Z^N_n$.

Particle filter, smoother and parameter learning

Leave a Reply