# Hidden Markov Model

In this lecture,  we will review the exact inference of hidden Markov models and the Kalman filter. They’re pretty similar. The hidden Markov model can be explained as a graphical model. It’s a tree structure.

The definition of HMM is given as: The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. It is only the outcome, not the state visible to an external observer and therefore states are “hidden” to the outside; hence the name Hidden Markov Model. Briefly, the HMM is about finding the probability of the observation to be in a certain state. The difference between Markov Model and HMM is that in HMM the observation is also a probabilistic function of the state.

In a mathematical way, HMM is described as:

$(X_t)\in 1, 2, …, N$, where $t=1, 2, 3, …\\$

$(Y_t)\in 1, 2, …, N$, where $t=1, 2, 3, …$

$P(X_{t+1}=j|X_t=i)=p_{i,j}$  $X_0$ is given.

$P(Y_{t}=j|X_t=i)=q_{i,j}$

In the model, X is hidden, or what we call latent. Ys are observations.

There are three fundamental problems that we can solve using HMMs. The problem is that given $Y_1$,$Y_2$, to $Y_t$, what can we say about $X_1$,$X_2$, to $X_t$? First, we can get the probability distribution of Xt under the condition of all the observations. In this case, it is the inference problem. The second  is that what is the maximum likelihood sequence of $X_1$,$X_2$, …, $X_t$ that maximizes the joining probability of the Xs and Ys. Third, given a sequence of observations of $Y_1$ to $Y_T$, what is the parameters. The parameters are the probability of stating function from  one state at time t to another state at time t+1 in the weather example.

In another word, the problems are:

1. Given the HMM model and the observation sequence, find the probability distribution of the state at a certain time.
2. Given the Hidden Markov Model and the observation sequence, find the optimal state sequence for the underlying Markov process.
3. Given an observation sequence, find the parameters of the Hidden Markov Model that maximizes the probability of the sequence.

# Forward-backward algorithm

The forward-backward algorithm has very important applications to both hidden Markov models (HMMs) and conditional random fields (CRFs). It is a dynamic programming algorithm, and is closely related to the Viterbi algorithm for decoding with HMMs. The forward-backward algorithm is comprised of the forward step and a backward step. We have several statistics:

Forward statistic/probability:

$\alpha_t(i)=P(Y_1, …, Y_t, X_t=i)$

$\hat{\alpha}_t(i)=P(X_t=i | Y_1, …, Y_t, X_t=i)=\frac{\alpha_t(i)}{\sum_i\alpha_t(i)}$

where $\alpha_t$ is the forward parameter, which is the probability for us to get the latent state, when $X_t=i$, conditional on all observations up to time t. $\hat{\alpha}_t$ is the conditional probability of the latent state $X_t$, conditional on previous observations, $Y_1$ to $Y_t$.  $\hat{\alpha}_t$ is obtained by normalizing the $\alpha_t$. According to the Bayesian theorem, there is a recursive relationship between $\alpha_t$ and $\alpha_{t+1}$ which is given by:

$\alpha_{t+1}(i)\propto \sum_j\alpha_t(j)p_{j,i}q_{i,y_{t+1}}$

where $p_{j,i}$ is transition probability that $X_t=j$ and $X_{t+1}=i$. This term is usually called the transition term. $q_{i,y_{t+1}}$ denotes the conditional probability of the observation $y_{t+1}$, condition on $X_{t+1}=i$. The term is usually called the observation term. The scaled version can be obtained in a similar way.

$\hat{\alpha}_{t+1}(i)\propto \sum_j\hat{\alpha}_t(j)p_{j,i}q_{i,y_{t+1}}$ recursion

$c_{t+1}=\sum_i\sum_j\hat{\alpha}_t(j)p_{j,i}q_{i,y_{t+1}}$ scaling factor

Then we can find out the backwards statistic by the same kind of iteration.

Backward statistic:

$\beta_t(i)=P(y_{t+1}, …, y_T | X_t(i))=\sum_jp_{i,j}\cdot \beta_{t+1}(j)\cdot q_{j,y_{t+1}}$

$\hat{\beta}_t(i)=\frac{\beta_t(i)}{\prod _{s=t+1}^T c_s}$

$\hat{\beta}_t(i)=\frac{1}{c_{t+1}}\sum_j p_{i,j}\cdot \hat{\beta}_{t+1}(j) \cdot q_{j,y_{t+1}}$

From the forward statistic and the backward statistic, we can find out the one-slice statistic, which is defined as gamma, and which is the probability of a latent state X at time t, conditional on all observations from $y_1$, to $y_T$.

One-slice statistic:

$\gamma_t(i)=P(X_t=i | y_1, …, y_T)=\hat{\alpha}_t(i)\cdot\hat{\beta}_t(i)$

We can also find the two-slice statistic, which is the probability distribution of $X_t$, $X_{t+1}$, conditional on $Y_1$ to $Y_T$.

Two-slice statistic:

$\xi_t(i,j)=P(X_t=i, X_{t+1}=j | y_1, …, y_T)=\hat{\alpha}_t(i)\cdot p_{i,j} \cdot \hat{\beta}_{t+1}(j) \cdot q_{j,y_{t+1}}$

In addition we can think about the forward-backward algorithm in another way, which is in the forward step we estimate the $\hat{\alpha}_t$ for t from 1 to T, which is the conditional probability of the latent variable, based on observations up to time T. Then in the backward step, we estimate the $\gamma_t$ for t from T down to 1. When we need to estimate $\gamma_t$ for t from T down to 1, one way is that estimating $\gamma_t$ backwards from T to T-1 to T-2, and so on. What we know is the $\gamma_T$ which is the probability of $X_T$, conditional on $Y_1$ to $Y_T$. That one is just the $\hat{\alpha}_T(X_T)$, which is already known. Then we can get the $\gamma_t$ from $\gamma_{t+1}$ using recursion. In this way, $\hat{\beta}_t$ can be treated as the division between $\gamma_t$ and $\hat{\alpha}$.

# Viterbi algorithm

The idea of Viterbi algorithm is actually pretty easy, although the notation is convoluted. It is a dynamical programming algorithm. The idea is that given $Y_1$ to $Y_T$, we are going to find out the sequence of $X_1$ to $X_T$ that maximizes the probability of $X_1, …, X_T$ and $Y_1, …, Y_T$.

Forward step:

$V_{1,k}=1$ for $X_1=k$ and 0 otherwise

$V_{t,k}=P(Y_t=y_t | X_t=k) \cdot max_j(V_{t-1}\cdot p_{j,k})$, maximum probability over all sequence $X_1, …, X_t$ with $X_t=k$

$U_{t,k}=arg~max_j(V_{t-1,j}\cdot p_{j,k})$ the previous state in the above sequence$X_1, …, X_{t-1}=U_{t,k}, X_t=k$

The trick here is recursion. Each time we calculate the $V_{t,k}$, which is the probability of the most probable state sequence responsible for the first $t$ observations that have $k$ as its final state. For example in time t, we have $V_{t,k}$, where $k\in{1,2,… , n}$. Then in time t+1, the $V_{t+1,k}$ can be obtained by checking the probability of all the sequences which end with state k. The probability of the sequence is calculated as follows. If the state in time t is j then the sequence maximize the probability of $X_1, …, X_t$ and $Y_1, …, Y_t$ has the probability $V_{t,j}$. Thus the sequence has state j at time t and state k at time t+1 is $V_{t,j}p_{j,k}$, where $p_{j,k}$ is the transition probability from state j to state k. That is the kind of recursion. After we end up with time T, we just go backwards. We say what is the sequence? What is the previous state of the sequence that ends up with the maximum probability that ends up with.

Backward step:

$x_T=arg~max_k V_{T,k}$

$x_t=U_{t+1,x_{t+1}}$

By this kind of forward-backward algorithm, we can find out the sequence that has the maximum probability, and that has the maximum probability, based on observation. That is the algorithm that involves forward step and the backward step. The notation is pretty convoluted, but the idea is pretty simple.

# Maximum likelihood estimator given latent state sequence

• HMM assumes a sequence of Random Variables to be conditionally independent given a sequence of latent state variables which forms a Markov chain.
• We can start by maximizing the likelihood equation and will conclude that Expected Maximization algorithm is best path for MLE of Hidden Markov models.
• The situation is that we know only the observations and we do not know the Latent States. In this case we yield to EM algorithm. Instead of maximizing the likelihood, we maximize the expected likelihood.

## Maximum likelihood approach

Let’s say we have observed some data

$Y = {(y_1, y_2…y_n)^T}$
we want to fit a likelihood (or posterior) model by maximizing log-likelihood (or posterior) $l(\theta;Y) = log (p(Y|\theta))$

To formalize this problem, we will follow the standard statistical practice of introducing a family of candidate models for consideration. To this end, let $(\Theta, H)$ be a measurable space, called the parameter space. We consider that $\theta \in \Theta$ and the law of the Hidden Markov model $(X_t, Y_t) t >= 0$ defined by $P^\theta$. Our goal is now to select, on the basis of an observed sequence $Y = {(y_1, y_2…y_n)^T}$, a parameter estimate $\theta \in \Theta$ such that the observation statistics are well described by the law $P^\theta$.

What makes for a good parameter estimate $\hat{\theta}$ ? Note that the estimator depends, by definition, on the observed training data $\hat{\theta}$ = $\hat{\theta_n}(y_1, y_2…y_n)$. We would like to guarantee that the estimate $\hat{\theta_n}$ is close to the ‘true’ value of $\theta$ for large n, regardless of what the ‘true’ parameter happens to be. To be precise, we would like to show that  $\hat{\theta_n}$ \to $\theta$ in $P^\theta$ for every $\theta \in \Theta$.

Suppose that we don’t know the explicit form of $p(y|\theta)$, instead we know there are some unobserved (hidden) variable x, and we can write down $p(y|\theta)$ as an integration of the joint probability of y and x, so the process steps in this analysis are as follows:

We have defined Hidden Markov model: $(X_t, Y_t)$ for $t \ge 0$

Also, from previous section: $P(X_1,…,X_T,Y_1,…,Y_T) = \pi_tP(X_t|X_(t−1))P(Y_t|X_t)$

Maximize log likelihood

$l(\theta; Y) = log \sum\limits_{t} p(Y_t, X_t | \theta)$.

$=\sum\limits_{t} log P(X_t|X_(t−1)) + \sum\limits_{t} log P(Y_t|X_t)$

$= \sum\limits_{i} \sum\limits_{j} (log p_{i,j} \sum\limits_{t} (X_t = i, X_{t+1} = j)) + \sum\limits_{i} \sum\limits_{j} (log q_{i,j} \sum\limits_{t} (X_t = i, Y_{t} = j))$

Subject to $\sum\limits_{j}p_{i,j} = 1$ and $\sum\limits_{j}q_{i,j} = 1, \forall i$

$p_{i,j} = \frac{\sum\limits_{t}(X_t = i, X_{t+1} = j)}{\sum\limits_{t}(X_t = i)}$

$q_{i,j} = \frac{\sum\limits_{t}(X_t = i, Y_{t} = j)}{\sum\limits_{t}(X_t = i)}$

Directly maximizing $l(\theta; Y)$ of this form is difficult. Instead of examining through all possible x and maximizing their sum, we are going to use an iterative, greedy searching technique called Expectation-Maximization to maximize the log-likelihood. The algorithm will converge to a local maximum of the likelihood function.

## Expectation-Maximization : Introduction

The EM algorithm is applied for optimization of the likelihood functions in cases where the model is specified with probabilities in terms of an observed and an unobserved (latent) component. HMMs are models of this form, hence HMMs form a class of models for which the EM algorithm can be useful.

In general, if the model consists of two components (X,Y), which we assume take values in a finite space for simplicity, and if the probabilistic model specification consists of the joint point probabilities $P^\theta(X,Y)$, parametrized by $\theta$, then the likelihood when observing only $X=x$ is:

$l(\theta; Y) = log \sum\limits_{t} p(Y_t, X_t | \theta)$

This is not a simplified sum. For HMMs the sum will be over all possible transitions between the hidden states, which quickly becomes a formidable number when the length of the observed sequence grows. Fortunately there are algorithms for HMMs (forward-backward) for fast computation of the likelihood, and the likelihood could then, in principle, be plugged into any general purpose optimization algorithm for maximum-likelihood estimation of $\theta$. The alternative is the EM-algorithm. This is an algorithm that iteratively alternates between:

• The E-step, which is a computation of a conditional expectation given the observed x under the current estimate of $\theta$.
• The M-step, which is a maximization.

The EM-algorithm makes most sense if the two steps above can be implemented in a computationally efficient way, e.g. when we have closed form expressions for the conditional expectation and the maximization.

## Expectation-Maximization : Jensen’s Inequality, Algorithm and Equations

We maximize the expected value of this logarithm. The reason for us to do this is by Jensen’s inequality.

### Jensen’s Inequality

Jensen’s inequality is a general result in convexity. Let $f$ be a function whose domain is the set of real numbers. Recall that $f$ is a convex function if $\frac{d^{2}y}{dx^{2}}\ge 0$ (for all $x \in R$).

If $\frac{d^{2}y}{dx^{2}} > 0$ for all x, then we say $f$ is strictly convex. Jensen’s inequality can then be stated as follows:

Theorem: Let $f$ be a convex function, and let $X$ be a random variable. Then

$E[f(X)] \ge f(E[X])$

Moreover, if $f$ is strictly convex, then $E[f(X)] = f(E[X])$ holds true if and only if $X = E[X]$ with probability 1 (i.e. if X is a constant).

EM Algorithm

Step One: Find a lower-bound of $l(\theta; y)$

First we introduce a density function q(x) called averaging distribution.
A lower-bound of the log-likelihood is given by:

$l(\theta; Y) = log(P(y|\theta))$
$= log(\sum\limits_{x}p(y,x|\theta))$
$= log(\sum\limits_{x}q(x)\frac{p(y,x|\theta)}{q(x)})$
$\ge \sum\limits_{x}q(x)\frac{p(y,x|\theta)}{q(x)}$
$= E_{q(x)}[log(p(y,x|\theta))] + Entropy[q(x)]$
$= L(q, \theta; y)$

The $\ge$ follows from Jensenâs inequality (log-concavity). More explicitly we can decouple $l(\theta; y)$ as the sum of three terms:

$l(\theta; y) = E_{q(x)}[log(p(y,x|\theta))] + KL[q(x) || p(x,y|\theta)] + Entropy[q(x)]$
Will learn about KL (Kullback Leibler) later in series.
The expectation term $E_{q(x)}[log(p(y,x|\theta))]$ is called the-expected-complete-log-likelihood (or Q-function). The equation says that the sum of the Q-function and the entropy of averaging distribution provides a lower-bound of the log-likelihood.

Step Two: Maximize the bound over $\theta$ and q(x) iteratively.

Look at the bound $L(q, \theta; y)$. The equality is reached only at $q(x) = p(x|y, \theta)$, and the entropy term is independent of $\theta$. So we have：

E-step:  $q^t = {argmax}_q \; L(q, \theta^{t-1};y) = p(x|y;\theta^{t-1})$

M-step: $\theta^t = {argmax}_{\theta} \; L(q, \theta^{t};y) = \;{argmax}_{\theta} \; E_{q^t(x)}[log(p(y,x|\theta))]$

# HMM algorithm covered so far

The inference problems that we often involve:

• probability of an observed sequence $P(Y_1,…,Y_T)$
• filtering $P(X_t|Y_1,…,Y_t)$
• smoothing $P(X_t|Y_1,…,Y_T)$
• best explanation ${argmax}_{X_1,…,X_T} P(X_1,…,X_T,Y_1,…,Y_T)$
• model selection

Caveats:

• EM doesn’t guarantee the asymptotic properties of MLE — consistency and asymptotic normality.
• Event rates and activity durations are often function of time
• Events could happen between observations
• Between HMM, if we have the latent variable then we can have maximum likelihood estimation of the parameters. If we do not, then all we can expect is that our estimation is the lower bound. The real values of the parameters can only make it better. We get some conservative estimation about the performance of our estimator.

# Discrete-time Gaussian process

From Wikipedia: In a Gaussian process, every point in some input space is associated with a normally distributed random variable. Moreover, every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions.

A Gaussian process is fully specified by its mean function $m(x)$ and covariance function $k(x,x’)$. This is a natural generalization of the Gaussian distribution whose mean and covariance is a vector and matrix, respectively. The Gaussian distribution is over vectors, whereas the Gaussian process is over functions. We can write:

$f \sim GP(m, k)$,

meaning: “the function f is distributed as a GP with mean function m and covariance function k”.

The individual random variables in a vector from a Gaussian distribution are indexed by their position in the vector. For the Gaussian process it is the argument x (of the random function f(x)) which plays the role of index set: for every input x there is an associated random variable f(x), which is the value of the (stochastic) function f at that location.

Latent State : $x_t$

Observed State : $y_t$

The point is to estimate latent state $x_t for t=1,2,…$from observations $y_t$

given the following:

• $latex x_t = F_t x_{t-1} + B_t u_t + w_t$, where $latex w_t\sim {\cal N}(\vec 0, Q_t)$
• $latex z_t = H_t x_t + v_t$, where $latex v_t\sim {\cal N}(\vec 0, R_t)$

Procedure to follow:

• Iterate over t = 1,2….
• Calculate probability distribution of $X_t$ from probability distribution of $X_{t−1}$
• Calculate joint distribution of $(X_t Z_t)$ and update $X_t$ as conditional distribution over $Z_t=z_t$
• This can be classified as Multivariate Normal Distribution, which is given by:
• $p(x;\mu,\sigma) = \; (2\pi)^{-k/2} |\Sigma|^{-1/2} exp(-1/2 (x – \mu)^T \Sigma^{-1} (x – \mu))$

# Multivariate Normal Distribution- Linear Transformation

Suppose we have x which is a random variable that belongs to the Gaussian distribution with mean $\mu$ and variance matrix $\Sigma$. Now, linear transformation is applied on x to create a new random variable $y=Ax+b$. We find the mean of $y$ by using the fact that $E$ is a linear operator.
$\bar{y}=E{(y)}=E{(Ax+b)}=AE{(x)}+b=A\bar{x}+b=A\mu +b$
Then we find covariance:

$Covariance=E{((y-\bar{y})(y-\bar{y})^T)}$
$=E{[(Ax+b)-â(A\bar{x}+b)][(Ax+b)-â(A\bar{x}+b)]^T}$
$=E\{{[A(xâ-\bar{x})][A(xâ-\bar{x})]^T}\}$
$=E\{{A(xâ-\bar{x})(xâ-\bar{x})^TA^T}\}$
$=AE\{{(xâ-\bar{x})(xâ-\bar{x})^T}\}A^T$
$=A \Sigma A^T$

(Linear transformation here involves first multiplying the matrix x by A, and transforming this matrix, so it involves rotation and scaling. Then we apply this Ax to another constant, b which involves shifting.That is $Ax+b$ involves first distorting X by stretching and rotation and then offset it by b.)

Thus we see that the resultant random variable also belongs to a Gaussian distribution with mean A$\mu$ +b and variance A$\Sigma$ $A^T$.

# Multivariate Normal Distribution – Conditional Probability

Given
$\left( \begin{array}{c} X_1 \\ X_2 \end{array} \right) \sim{\cal N}\left(\left(\begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right),\left(\begin{array}{c} \Sigma_{11}\quad \Sigma_{12} \\ \Sigma_{21}\quad \Sigma_{22} \end{array} \right)\right)$

The conditional distribution of $X_1$ given $X_2$ is:

$X_1 | X_2 = x_2\sim{\cal N}(\mu_1 + \Sigma_{12} \Sigma_{22}^{-1} ( x_2 – \mu_2), \Sigma_1-\Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21})$

Proof:

First we will take a look at the Schur’s complement which will be essential to find out the conditional distribution of $X_1$ given $X_2$.

## Schur complement

We need to find inverse of the covariance matrix $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)$.

In order to do that, we first use the 2nd column $\left(\matrix{\Sigma_{12}\cr\Sigma_{22}}\right)$ to Gaussian-eliminate $\Sigma_{21}$, which is equivalent to right multiply the covariance matrix with $\left(\matrix{I & 0 \cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$. We then use the resulting 2nd row $\left(\matrix{0 & \Sigma_{22}}\right)$ to eliminate $\Sigma_{12}$, which is equivalent to left multiply the covariance matrix with $\left(\matrix{ I & \Sigma_{12}\Sigma_{22}^{-1} \cr 0 & I}\right)$. We perform the following operation:

$\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right) \left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$
$= \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr & I}\right) \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\cr 0 & \Sigma_{22}}\right)$
$= \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$

If we move $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$ and $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)$ to the right hand side of the equation above, then $\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$ and the resulting $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)^{-1}$ to the left hand side of the equation, we get the following:
$\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)^{-1}$
$= \left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right) \left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right) \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$

## Application of Schur complement to find the conditional pdf of a multivariate normal distribution

Let $x = \left(\matrix{X_1\cr X_2}\right)\sim \mathcal{N}\left(\left(\matrix{\mu_1\cr\mu_2}\right), \left(\matrix{\Sigma_{11} &\Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right) \right)$ be multivariate normal distribution, where $X_1$ and $X_2$ are vectors of dimension $p$ and $q$ respectively, and we want to find the conditional distribution of $X_1$ on $X_2=x_2$.
The probability density function of $X$ is
$f\left(\matrix{x_{1}\cr x_{2}}\right) \propto \exp\left(-\frac{1}{2} \left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right) \left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1} \left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)$

Then:

$\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1}$

$=\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right) \left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right) \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$
Hence $X_{1}|X_{2}=x_{2}\sim\mathcal{N}\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)$

# Kalman Filter

A Kalman filter is a recursive estimator, a weighted average of the previous estimate (propagated to the present) and the new information, where the weights are optimized to minimize the squared error.

In layman’s term: We had many measurements in the past that are all packed neatly into the previous estimate. Now we have a new measurement. The output of the Kalman filter is a value somewhere in the middle, based on the validity of the previous estimate (how much data, its quality, and how dynamic is the model), and the accuracy of the new information.

## The process

The Kalman filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of (noisy) measurements. As such, the equations for the Kalman filter fall into two groups: time update equations and measurement update equations. The time update equations are responsible for projecting forward (in time) the current state and error covariance estimates to obtain the a priori estimates for the next time step. The measurement update equations are responsible for the feedback i.e. for incorporating a new measurement into the a priori estimate to obtain an improved a posteriori  estimate.

The time update equations can also be thought of as predictor equations, while the measurement update equations can be thought of as corrector equations. Indeed the final estimation algorithm resembles that of a predictor-corrector algorithm for solving numerical problems.

The Kalman filter model assumes that the state of a system at a time t evolved from the prior state at a time t-1 according to the equation:

$X_t = F_t X_{t-â1} + B_t u_t + w_t$

It means that each $x_k$  (our signal values) may be evaluated by using a linear stochastic equation (the first one). Any $x_k$ is a linear combination of its previous value plus a control signal uk and a process noise (which may be hard to conceptualize).

• $X_t$ is the state vector containing the terms of interest for the system (e.g., position, velocity, heading) at time t
• $u_t$ is the vector containing any control inputs (steering angle, throttle setting, braking force)
• $F_t$ is the state transition matrix which applies the effect of each system state parameter at time t-1 on the system state at time t (e.g., the position and velocity at time t-1 both affect the position at time t)
• $B_t$ is the control input matrix which applies the effect of each control input parameter in the vector ut on the state vector (e.g., applies the effect of the throttle setting on the system velocity and position)
• $w_t$ is the vector containing the process noise terms for each parameter in the state vector. The process noise is assumed to be drawn from a zero mean multivariate normal distribution with covariance given by the covariance matrix $Q_t$.

We are basically adding the action state in addition to the data to determine our position. This allows us to incorporate a moving state.

If we are pretty sure that the system into consideration fits into the above model, then the next step is to determine the necessary parameters and the initial estimates. As already mentioned, there are two distinct set of equations: Time update(prediction) and Measurement Update(correction).

For the prediction we basically find out what is the distribution of $X_t=F_t X_{t-1}+B_t u_t+w_t$, where $w_t\sim {\cal N}(0,Q_t)$, given that
$X_{t-1}\sim{\cal N}(\hat x_{t-1|t-1}, P_{t-1|t-1})$?

$X_{t}\sim {\cal N}(F_t \hat x_{t-1|t-1}+ B_t u_t, F_t P_{t-1|t-1}F^{\rm T}_t+Q_t)$

The standard Kalman filter equations for the prediction stage are:

predicted (a priori) state$\hat{x}_{t|t-1}=F_t\hat{x}_{t-1|t-1}+B_t u_t$

predicted (a priori) covariance $P_{t|t-1}=F_tP_{t-1|t-1}F_t^T+Q_t$

where $Q_t$ is the process noise covariance matrix associated with noisy control inputs.

To start the process, we need to know $x_0$,$P_0$. For the measurement update equations, we find out what is posterior distribution of $X_t$ given that the prior distribution is $X_{t}\sim {\cal N}(\hat x_{t|t-1}, P_{t|t-1})$, and we have made an observation $Z_{t} = H_t X_t + v_t=z_t$, where $v_t\sim {\cal N}(0,R_t)$

Innovation $\epsilon_t = z_t-H_t\hat x_{t|t-1}$
Innovation covariance $S_t = H_t P_{t|t-1} H_t^{\rm T} + R_t$
Optimal Kalman gain $K_t = P_{t|t-1}H_t^{\rm T}S_t^{-1}$
Updated (a posterior) state estimate $\hat x_{t|t} = \hat x_{t|t-1} + K_t\epsilon_t$
Updated (a posterior) covariate $P_{t|t} = P_{t|t-1}-K_t H_t P_{t|t-1}$