Factorial Hidden Markov Model [09/29]

Prepared by Alireza Farasat, Simranjot Singh, Devesh Mulmule.

The Hidden Markov model (HMM) is one of the basic statistical tools for modeling discrete time series. In an HMM, information about the past is conveyed through a single discrete variable—the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner.
An HMM is essentially a mixture model, encoding information about the history of a time series in the value of a single multinomial variable—the hidden state—which can take on one of K discrete values. This multinomial assumption supports an efficient parameter estimation algorithm—the Baum-Welch algorithm—which considers each of the K settings of the hidden state at each time step. However, the multinomial assumption also severely limits the representational capacity of HMMs. For example, to represent 30 bits of information about the history of a time sequence, an HMM would need K = 230 distinct states. On the other hand, an HMM with a distributed state representation could achieve the same task with 30 binary state variables.In the following section we define the probabilistic model for factorial HMMs.

Factorial Hidden Markov Models — Generative Description

We begin by describing the Hidden Markov model, in which a sequence of observations
$\left\{Y_t\right\}$ where t = 1,…T, is modeled by specifying a probabilistic relation between the observations and a sequence of hidden states $\left\{X_t\right\}$, and a Markov transition structure linking the hidden states. The model assumes two sets of conditional independence relations:

  • that $Y_t$ is independent of all other observations and states given $X_t$,
  • and that $X_t$ is independent of $x_1…x_t−2$ given $S_t – 1$ (the first-order Markov property).


Factorial HMM looks like like presented in the above. Each x in the figure is a latent variable, is represented by a chain indexed by time. So we see that individual chances for each x represented with a superscript evolve as a Markov Chain. Each x could take discreet states like success, failure etc. At each time stamp we have observation that summarizes over the x over all chains. The goals is to make inference about individual chance of given the state transition matrix over all x and the observations. X is a hence a matrix of latent states where columns indexed by chains and rows indexed by time. Y is the observer.
Each Markov chain evolve independent of each other with each x taking one of the discreet states, d. Observation Y is hence a Gaussian distribution around linear combination of all X. Therefore the total number of states that X can take is $d^M$ (M: number of Markov chains).
Using these independence relations, the joint probability for the sequence of states and observations can be factored as.
P(\left\{X_t, Y_t\right\}) = P(X_1)P(Y_1|X_1)\prod_{t=2}^{T}P(X_t|X_t−1)P(Y_t|X_t)
In factorial HMMs the underlying state transitions are
constrained. A natural structure to consider is one in which each state variable evolves according to its own dynamics, and is a priori uncoupled from the other state variables:

P(\left\{X_t| X_{t-1}\right\})= \prod_{m=1}^{M}P(X^{(m)}_t |X^{(m)}_{t−1})

In a factorial HMM the observation at time step t can depend on all the state variables at that time step. As mentioned before for continuous observations, one simple form for this dependence is linear Gaussian; that is, the observation $Y_t$ is a Gaussian random vector whose mean is a linear function of the state variables. We represent the state variables as K × 1 vectors, where each of the K discrete values corresponds to a 1 in one position and 0 elsewhere. The resulting probability density for a $D × 1$ observation vector $Y_t$ is

&&P(Y_t| X_t)= |C|^{−1/2}(2\pi)^{−D/2}exp \left\{\frac{-1}{2}(Y_t − µ_t)’C^{−1} (Y_t − µ_t)\right\},\\
\mbox{where} && \mu_t = \sum_{m=1}^{M}W^{(m)}X^{(m)}
Each $W^{(m)}$ matrix is a $D\times K$ matrix whose columns are the contributions to the means for each of the settings of $X^{(m)}_t$ , C is the D × D covariance matrix.

The joint log likelihood for the sequence of states and observations hence can be represented as

&& \log P(\left\{X_t, Y_t\right\}) = -\frac{1}{2}\sum_{t=1}^{T}log P(Y_t|X_t) + \sum_{t=1}^{T}\sum_{m=1}^{M}log P(X^{(m)}_t|X^{(m)}_{t-1}) −\frac{T}{2}log|C| -\frac{d.T}{2}log(2\pi),\\
\mbox{where} && \log P(Y_t|X_t) = Y^T_tC^{−1}Y_t – 2\sum_{m=1}^{M}Y^T_tC^{−1}W^{(m)}X^m_t + \sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{-1}W^{(n)}X^{(n)}_tX^{(m)T}_t),\\
\mbox{and} && \log P(X^{(m)}_t|X^{(m)}_{t-1}) = logP^{(m)}X^{(m)}_tX^{(m)T}_{t-1}

where $tr(J)$ represents the trace of matrix J.

Trace Basics

Let A be $n \times n$ matrix. Then its trace is given by $Tr(A) = \sum_i A_{ii}$.

Some properties/lemma of Trace:

  • $Tr(A+B) = Tr(A) + Tr(B)$
  • $Tr(c.A) = c.Tr(A)$
  • $Tr(A^{T}) = Tr(A)$ —diagonal elements are same in $A^{T}$
  • $Tr(A.B) = Tr(B.A)$

\mbox{Tr}(A.B) & = & a_{11}.b_{11}+a_{12}.b_{21}+\cdots+a_{1n}.b_{n1}\\
& & +a_{21}.b_{12}+a_{22}.b_{22}+\cdots+a_{2n}.b_{n2}\\
& & +\cdots\\
& & +a_{n1}.b_{1n}+a_{n2}.b_{2n}…+a_{nn}.b_{nn}\\
\mbox{Tr}(B.A) & = & b_{11}.a_{11}+b_{12}.a_{21}…+b_{1n}.a_{n1}\\
& & +\cdots\\
& & +b_{n1}.a_{1n}+b_{n2}.a_{2n}…+b_{nn}.a_{nn}

Hence, $Tr(A.B) = Tr(B.A)$

  • $Tr(X^{T}Y) = Tr((X^{T}Y)^{T}) = Tr(Y^{T}X)$ —since $Tr(A^{T}) = Tr(A)$
  • $Tr(XY^{T}) = Tr((XY^{T})^{T}) = Tr(YX^{T})$
  • $Tr(X^{T}Y) = Tr(YX^{T}) = Tr(Y^{T}X) = Tr(XY^{T}) = \sum_{ij}X_{ij}Y_{ij}$, which is the dot product of two matrices.
  • Jacobi’s formula for determinant of matrix is: $\frac {d|A|}{dA^{-1}} = A$

Exact Inference of Factorial Hidden Markov Models

Doing exact inference is on Factorial HMM is pretty intractable, especially with a large number of chains where the number of states start exploding. If we do not observe Y in HMM then the chains of X evolve independent of each other. However if we make an observation, then the chains can be seen related as well. This can be explained by the Berkison Paradox — If we restrict ourselves to the cases where events A or B occur where least one of the events A or B occurs, the knowledge that B has occurred makes it less likely that A has occurred or
$P(A | B, A or B) < P(A | A or B)$ where $0 < P(A), P(B) < 1$ (which means that the events have to be interesting enough that $A/B$ do not occur with certainty or never).

Mathematically, it follows that, given $X^{(m)}$ and $X^{(n)}$ for $m≠n$ are independent, then $X^{(m)}_t$ and $X{(n)}_t$ for $m\neq n$ become dependent given observation $Y^{(m)}_t$ about them.
Hence exact inference involves enumerating all $(X^{(1)},X^{(2)},…,X^{(M)})$ — become computationally intractable when M is large.
Suppose a university the admission committee recruits brainy or sporty candidates. A student can either be brainy or sporty or both or neither, however the probability of either is independent. Now, we see that the college normally recruits brainy students and not sporty students. It follows that given that a student is brainy, he is less likely to be sporty. This provides that although the chances X in Factorial Hidden Markov Model are uncorrelated (i.e they evolve independently), but after we perform observation over all the states then the chances can be seen as correlated. Hence we have to estimate joint state of all of these chances together (condition and observation) which leads to an exploding number of states. Say each chain can take 2 states and we have 50 chains then the resultant number here will be $2^{50}$.

Naive Mean Field Approximation

From the exact inference (and Berkson paradox) on factorial Hidden Markov Model, if we do not observe $Y_t$ (at time slice t), then the latent states(chains at t) $X_t^{1}, X_t^{2}…. X_t^{M}$ will evolve independent of each other. But if $Y_t$ is available, then chains are correlated; and hence the number of states will explode. As an example, if a state can take upto K values and M chains are present, then the factorial HMM would require $K^M$ states (huge space!), given that observation is available. So, when $Y_t$ is available, then for two latent states $X_t^{m}$, $X_t^{n}$ we can say, $P(X_t^{m}, X_t^{n}/Y_t) \neq P(X_t^{m}|Y_t) \cdot P(X_t^{n}|Y_t) $. Since, we cannot cope with this joint probability distribution, we need a simpler distribution.

The naive mean field approximation is given as distribution Q which approximates the actual distribution of  $X_t^{m}$’s with parameters $\theta_t$ for each of the chains m. We can express this approximation as: $Q(X_t/\theta) = \prod_t \prod_m Q(X_t^{m}/\theta^{m})$. In other words, we assume that under $Q$ the distribution of $X_t^{m}$ with parameter $\theta_t^{m}$,is independent of other $X_t^{n}$ where $m \neq n$ and for each chain m, $X_t^{m}$ is independent of $Y_t$. By this assumption, we get rid of all the dependencies in factorial HMM between the chains and ii)between states and observation. Thus, the complexity of the model is highly reduced.

This distribution Q belongs to the exponential family; Given that a state can take up to $K$ values, then for a chain m, this approximate distribution can be written as: $Q(X_t^{m}/\theta) = \prod_k (\theta_{t,k}^{m})^{X_{t,k}^{m}} = exp(X_t^{(m)T}. log\theta_t^{m})$, where $\sum_k \theta_{t.k}^{m} = 1$.
Here, the $X_t^{(m)T}$ is a unit row vector $(X_{t,1}^{(m)}, X_{t,2}^{(m)}, .. X_{t,K}^{(m)})$ where only one value would be 1 and the rest would be zero, since at time t $X_t^{m}$ can exist in only of the K states. Thus, at the end, we will only have one $log\theta_{t,k}^{m}$ value.

The ultimate goal is to determine the parameters $\theta_t^{m}$ for a chain m at time t by minimizing the KL divergence. For computing this KL divergence, some statistics for this distribution Q are defined as follows:

  • Expected value of Q : $E_Q X_t^{m} = \theta_t^{m}$
  • Variance of $X_{t-1}^{m}$ and $X_{t}^{m}$: $E_Q X_{t-1}^{m} X_{t}^{(m)T} = \theta_{t-1}^{m}\theta_t^{(m)T}$
  • Co-Variance of $X_{t-1}^{m}$ and $X_{t}^{m}$:
    • $E_Q X_t^{m} X_t^{(n)T} = \theta_t^{m} \theta_t^{(n)T}$—for $m \neq n$
    • $= diag(\theta_t^{m})$—-for $m=n$

(where diag() is an operator that takes a vector and returns a square matrix with the elements of the vector along its diagonal, and zeros everywhere else.)

Now we know that log likelihood of actual distribution P is given as:

\log P(X_t,Y_t) &=& -1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} X_t^{m}\\
&&+\sum_{t=1}^{T} \sum_{m=1}^{M} tr(\log P^{m} X_t^{m} X_{t-1}^{(m)T})-(T/2) \log |C|-(d.T/2) \log (2.\pi)

The KL divergence is the amount of information lost by expressing the actual distribution ‘P’ as the approximate distribution ‘Q’. In terms of entropy it can defined as: $KL(Q||P) = E_Q \log (Q/P)= E_Q \log Q -E_Q \log P$. From the definition of $Q(X_t^{m}|\theta)$, $E_Q \log Q = \sum_t \sum_m \theta_t^{(m)T} \log \theta_t^{m}$.

From the variance and covariance statistics of $E_Q$ from equation (2) and (3), we can write $E_Q log P$ as:

\[E_Q log P = -1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} \theta_t^{m} +\sum_{m!=n} tr(W^{(m)T}C^{−1}W^{n}\theta_t^{n}theta_t^{(m)T}) \]\[+\sum_{n=1}^{M} tr(W^{(m)T}C^{−1}W^{m} diag (theta_t^{(m)T})))\]\[+\sum_{t=1}^{T} \sum_{m=1}^{M} tr(log P^{m} \theta_t^{m} \theta_{t-1}^{(m)T})-(T/2) log |C|-(d.T/2) log (2.\pi) \]

Here we split out the second term in $E_Q log P$ which is \[\sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{−1}W^{n}X_t^{n}X_t^{(m)T})\] into two terms considering the co-variance statistics of distribution ‘Q’.\\

Thus we have,

KL(Q||P) &=& \sum_t \sum_m \theta_t^{(m)T} \log \theta_t^{m} +1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} \theta_t^{m}\\
&& +\sum_{m\ne n}tr(W^{(m)T}C^{-1}W^{n}\theta_{t}^{n}\theta_{t}^{(m)T})+\sum_{n=1}^{M}tr(W^{(m)T}C^{-1}W^{m}diag(\theta^{m}_{t}))\\
&& -\sum_{t=1}^{T} \sum_{m=1}^{M} tr(log P^{m} \theta_t^{m} \theta_{t-1}^{(m)T})+(T/2) log |C|+(d.T/2) log (2.\pi)

Differentiating the above KL divergence w.r.t $\theta_t^{m}$ and equating to 0 we get,
$\log \theta_t^{m}-W^{(m)T}C^{-1}Y_t +\sum_n W^{(m)T}C^{-1}W^{n}\theta_t^{n} +1/2 diag(W^{(m)T}C^{-1}W^{m})-log P^{m}\theta_{t+1}^{m}-log P^{(m)T}\theta_{t-1}^{m}
+const = 0$


$\theta_t^{m} \propto \exp (W^{(m)T}C^{-1}Y_t -\sum_n W^{(m)T}C^{-1}W^{n}\theta_t^{n} -1/2 diag(W^{(m)T}C^{-1}W^{m})+log P^{m}\theta_{t+1}^{m}+\log P^{(m)T}\theta_{t-1}^{m})$

Then we use fixed point algorithm to get the values of $\theta$, wherein we iterate over the values of $\theta_t^{m}$ until they converge.

Structured Approximation

Mean field approximation presented in the previous section factors the posterior probability such that all the state variables are statistically independent. In contrast to this rather extreme factorization there exists a more realistic approximation that is both tractable and preserves much of the probabilistic structure of the original system. In this scheme the factorial HMM is approximated by M uncoupled HMMs as shown in Figure 2.


Maximizing Log-Likelihood of Variational Distribution

As described in previous chapters, efficient and exact inference is implemented using the forward–backward algorithm within each HMM. Note that the mean field approximation presented in the previous section does not specify the form of distribution. Hence, a structural approximation should be employed using more structural variational distribution $Q$. It is expected that $Q$ provides a lower bound on the log likelihood for obtaining a learning algorithm. The only dependency assumed in this method is the dependency of $X_t^{(m)}$ on $X_{t-}^{(m)}$ and the rest of random variables are assumed to be independent. Therefore, the structured variational approximation is defined as follows:


where $Z_Q$ is a normalization constant guaranteeing that $Q$ is a valid probability distribution. Probability of $X_t^{(m)}$ given parameter $\theta$ and $X_{t-}^{(m)}$ is defined as:

$Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)=\prod_{k=1}^{K}\left(h_{t,k}^{(m)}\sum_{j=1}^{K}X_{t-1,j}^{(m)}\cdot P_{j,k}^{(m)}\right)^{X_{t,k}^{(m)}}=\exp\left(\left(\log\vec{h}_{t}^{(m)}+\vec{X}_{t-1}^{(m)}\cdot\log P^{(m)}\right)^{\mbox{T}}\cdot\vec{X}_{t}^{(m)}\right),$

where $\vec{h}_{t}^{(m)} = \{h_{t,1}^{(m)}, h_{t,2}^{(m)},\ldots, h_{t,K}^{(m)}\}$ is a vector indicating the set of parameters corresponding to $\vec{X}_{t}^{(m)}$ (note that $h_{t,k}^{(m)}$ is the parameter related to $X_{t-1,k}^{(m)}$ recalling that $X_t^{(m)}$ can take on $K$ values). In addition, $P_{j,k}^{(m)}$ is the transition probabilities of $X_t^{(m)}$ from state $j$ to state $k$. It is easy to show that $Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)$ follows an exponential family distribution (please see \cite{bishop2006pattern} chapter 2).
To characterize $Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)$, lets define the following moments (expected values of $X_{t}^{(m)}$, $X_{t-1}^{(m)}X_{t}^{(m){\rm T}}$ and $X_{t}^{(m)}X_{t}^{(n){\rm T}}$:

– $\mathbf{E}_{Q}X_{t-1}^{(m)}X_{t}^{(m){\rm T}}=\xi_{t-1}^{(m)}$
– $\mathbf{E}_{Q}X_{t}^{(m)}X_{t}^{(n){\rm T}}=\begin{cases}
\gamma_{t}^{(m)}\gamma_{t}^{(n)\mbox{T}} & m\ne n\\
\mbox{diag}\left(\gamma_{t}^{(m)}\right) & m=n

By factorizing $Q(\{X_{t}\}|\theta)$, one can rewrite the log likelihood of the variational distribution as follows:
$-\log Q(\{X_{t}\}|\theta)=-\sum_{t}\sum_{m}\log\vec{h}_{t}^{(m)\mbox{T}}\cdot\vec{X}_{t}^{(m)}-\sum_{t}\sum_{m}\vec{X}_{t-1}^{(m)\mbox{T}}\cdot\log P_{j,k}^{(m)}\cdot\vec{X}_{t}^{(m)}$

Note that maximizing log-likelihood function is equivalent to minimizing -1$\times$log-likelihood function.

Minimizing KL-Divergence of Variational Distribution

Maximizing log-likelihood function is equivalent to minimizing KL-divergence. Assuming $P=P(\{X_{t}\}|\theta)$ and $Q=Q(\{X_{t}\}|\theta)$ represents true and variational distribution of $X_{t}$. KL-divergence is defined as follows:
&&D_{KL}(Q\|P)=\mathbf{E}_{Q}\log Q-\mathbf{E}_{Q}\log P=\sum_{t}\sum_{m}\log\vec{h}_{t}^{(m)\mbox{T}}\cdot\vec{X}_{t}^{(m)}\\
&&+\frac{1}{2}\sum_{t=1}^{T}\left(Y_{t}^{\mbox{T}}C^{-1}Y_{t}-2\sum_{m=1}^{M}Y_{t}^{\mbox{T}}C^{-1}W^{(m)}\gamma_{t}^{m}+\sum_{m\ne n}\mbox{tr}\left(W^{(m)\mbox{T}}C^{-1}W^{(n)}\gamma_{t}^{(n)}\gamma_{t}^{(m)\mbox{T}}\right)+\sum_{m=1}^{M}\mbox{tr}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\mbox{diag}\left(\gamma_{t}^{(m)}\right)\right)\right)\\
&&-\log Z_Q+{T\over 2}\log |C| + {d\cdot T\over 2}\log (2\pi)\\
In order to minimize the KL-divergence, one need to take derivative respect to the parameters $h_{t}^{(n)}$. This also can be using EM algorithm. In this case, it is more convenient to take derivative respect to $\log h_{t}^{(n)}$ (Note that the solution to the KL-divergence remains the same. This can be verified using chain rule).
&&\frac{\partial D_{KL}(Q\|P)}{\partial\log h_{\tau}^{(n)}}=\gamma_{\tau}^{(n)}+\sum_{t=1}^{T}\sum_{m=1}^{M}\underbrace{\left(\log h_{t}^{(m)}-W^{(m)\mbox{T}}C^{-1}Y_{t}+\sum_{l\ne m}W^{(m)\mbox{T}}C^{-1}W^{(l)}\gamma_{t}^{(l)}+\frac{1}{2}\mbox{diag}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\right)\right)}_{\frac{\partial}{\partial\gamma_{t}^{(m)}}}\frac{\partial\gamma_{t}^{(m)}}{\partial\log h_{\tau}^{(n)}}\\
Solving the obtained equations results in the exponential family form of $\log h_{t}^{(n)}$.
&&\Rightarrow\log h_{t}^{(m)}\propto\exp\left(W^{(m)\mbox{T}}C^{-1}Y_{t}-\sum_{l\ne m}W^{(m)\mbox{T}}C^{-1}W^{(l)}\gamma_{t}^{(l)}-\frac{1}{2}\mbox{diag}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\right)\right)
Indeed, The parameter $h_{t}^{(m)}$ obtained from these fixed point equations is the observation probability associated with state variable $X_{t}^{(m)}$ in hidden Markov model $m$. Using these probabilities, the forward–backward algorithm is used to compute a new set of expectations for $X_{t}^{(m)}$.\\
The next step is maximization after finding the expected value of latent factors given the parameters. Indeed, minimizing KL-divergence to estimate the latent state includes finding out the expected value of latent states followed by, given the observations what values of parameters maximize the likelihood. How can one estimate the parameters? To estimate the parameters, one need to use maximum likelihood or maximum expected likelihood approach. In this case there exist two sets of parameters $W$ and $C^{-1}$ (Note that both $W$ and $C^{-1}$ are matrices). as mentioned before, $W$ is used to calculate the expected value of $Y_t|X_t$. If we take derivative respect to matrix $W$ and set the equation equals to zero, the result is:
&&\frac{\partial P(\{X_{t},Y_{t}\})}{\partial W^{(m)}}=\sum_{t}\sum_{m}C^{-1}Y_{t}X_{t}^{(m)\mbox{T}}+C^{-1}W^{(n)}X_{t}^{(n)}X_{t}^{(m)\mbox{T}}\stackrel{\mbox{set}}{=}0\\&\Rightarrow&
&&\mbox{where }X_{t}=\left(\begin{array}{c}

Similarly, precision matrix of the observation is estimated as follows:
&\frac{\partial P(\{X_{t},Y_{t}\})}{\partial P^{(m)}}=\sum_{t=1}^{T}X_{t-1}^{(m)}X_{t}^{(m)\mbox{T}}\\\Rightarrow&P_{i,j}^{(m)}=\frac{\sum_{t}X_{t-1,i}^{(m)}X_{t,j}^{(m)}}{\sum_{t}X_{t-1,i}^{(m)}}



Factorial Hidden Markov Model [09/29]

Leave a Reply