# Projects & Project ideas

In this post I collect together projects and ideas from the class and from myself. My criteria is that (a) a project should be a good complement to the content of this course, (b) can contribute to your career development, (c) are well defined and can be well managed and (d) can be finished in 4 weeks with 5 hours per week.

• Review data sets that track the locations and proximity of people with mobile phones and other types of sensors: Where can those data sets be downloaded? What are the works on each data set? What are the interesting findings, innovative methodology and significant applications of these data sets? Sum of 10-based logarithm of the data size (in bytes) >100, references > 50 from most cited works to least cited works. Tell a cohesive story about these data sets.
• Review social media data sets: Where can those data sets be downloaded? What are the works on each data set? What are the interesting findings, innovative methodology and significant applications of these data sets? Sum of 10-based logarithm of the data size (in bytes) >100, references > 50 from most cited works to least cited works. Tell a cohesive story about these data sets.
• Review one of the open source simulation softwares: Demonstrate how to use the software by walking through one example, show the architecture of the software (modules and their interactions, entry points, key modules, etc.), show how to dump events or system events from the software. Those software could include UrbanSim, MATSim, TranSIMS, MetroPolis, STEM,  etc.
• Define the dynamics of network evolution through a set of events, simulate network evolution, make inferences from observations using either variation method or MCMC. Hint: we can follow the examples in the problem sets, work with fewer events, no need to work with real world data.
• Define the dynamics of diffusion through a set of events, simulate network evolution, make inferences from observations using either variation method or MCMC. Hint we can follow the examples in the problem sets. Hint: we can follow the examples in the problem sets, work with fewer events, no need to work with real world data.
• Identify a real world data, postulate the dynamics, and conduct inference using one of the algorithms discussed in the lectures.
• more to come….

# Gibbs Sampling and Metropolis-Hastings Sampling [10/06]

Prepared by Aayush Uppal, Madhur Gupta and Nishchita Purushothama Patel

### Gibbs Sampling

#### Introduction to Monte Carlo Sampling

Gibbs Sampling is a Sampling technique, So the first fundamental question is, Why Sampling ?

Gibbs Sampling allows us to sample from a distribution that asymptotically follows P(π|X) (Prior information) without having to explicitly calculate the integrals. Gibbs Sampling is an instance of a Markov Chain Monte Carlo technique. Monte Carlo methods are algorithms that help you obtain a desired value by performing simulations involving probabilistic choices.

Understanding Monte Carlo technique with a simple example to estimate value of $\pi$

Draw a perfect square on the ground. Inscribe a circle in it i.e. the circle and the square are centered in exactly the same place, and the circle’s diameter has length identical to the side of the square. Now take a bag of rice, and scatter the grains uniformly at random inside the square. Finally, count the total number of grains of rice inside the circle (call that C), and inside the square (call that S). Therefore we can say that: $\frac{C}{S} \approx \frac{\pi(\frac{d}{2})^2}{d^2}$. Solving for $\pi$, we get $\pi \approx \frac{4C}{S}$

We just solved a problem by approximating the values of integrals. The more grains of rice we use, the better our approximation will be.

Understanding the Markov Chains aspect

In the above example the distribution was uniform, now we will assume that the distribution is not uniform, also we are interested in $E_{p(x)} [f(x)]$ The expected value is represented as: $E[f(z)] = \int f(z)p(z)dz$.

Conceptually, the integral in equation sums up f(z)p(z) over infinitely many values of z. But rather than touching each point in the sum exactly once, if you sample N points z(0) , z(1) , z(2) , . . . , z(N) at random from the probability density p(z), then.

$E_{p(z)}[f(z)] = \lim_{N\to\infty} \frac{1}{N}\sum \limits_{n=1}^{N} f(z^{(t)})$

Looking at this equation, it’s clear that we can get an approximate value by sampling only a finite number of times, T:

$E_{p(z)}[f(z)] \approx \frac{1}{T}\sum \limits_{t=1}^{T} f(z^{(t)})$

So, now we have a way to approximate the integral. The remaining problem is this: how do we sample $z^{(0)}, z^{(1)}, z^{(2)}, \cdots , z^{(T)}$ according to $p(z)$. For our purposes, though, the key idea is to think of z’s as points in a state space, and find ways to “walk around” the space — going from $z^{(0)}$ to $z^{(1)}$ to $z^{(2)}$ and so forth — so that the likelihood of visiting any point z is proportional to p(z). Here g is a function that makes a probabilistic choice about what state to go to next according to an explicit or implicit transition probability $P_{trans}(z^{(t +1)} | z^{(0)}, z^{(1)}, \cdots , z^{(t )})$

Now what will make it a Markov Chain Monte Carlo technique is defining things so that the next state you visit, $z^{(t+1)}$, depends only on the current state $z^{(t)}$ . i.e:

$P_{trans}(z^{(t+1)}|z^{(0)}, z^{(1)},\cdots,z^{(t)}) = P_{trans}(z^{(t+1)}|z^{(t)})$

The heart of Markov Chain Monte Carlo methods is designing g s o that the probability of visiting a state z will turn out to b e p(z), as desired. Gibbs sampling is one algorithm that meets the conditions.

#### The Gibbs sampling algorithm

Gibbs sampling is applicable in situations where Z has at least two dimensions, i.e. each point z is really z = <z1 , . . . , zk>, with k > 1. The basic idea in Gibbs sampling is that, rather than probabilistically picking the next state all at once, you make a separate probabilistic choice for each of the k dimensions, where each choice depends on the other k-1 dimensions.
$z^{(0)} := <z_1^{(0)},…,z_k^{(0)}>$
for t = 1 to T do
for i = 1 to k do
$z_i^{(t+1)} \approx P( Z_i | z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i+1}^{(t)}….,,z_{k}^{(t)})$
end for
end for

The distribution we are sampling from by using the definition of conditional probability :

$P( Z_i | z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i+1}^{(t)}….,,z_{k}^{(t)}) = \frac{P(z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i}^{(t)},z_{i+1}^{(t)}….,,z_{k}^{(t)})}{P(z_1^{(t+1)},..,z_{i-1}^{(t+1)},z_{i+1}^{(t)}….,,z_{k}^{(t)})}$

#### Gibbs Sampler Stationary Distribution

The Gibbs sampler has been used extensively in the statistics literature. It relies on iteratively sampling from a set compatible conditional distributions and the sampler is known to converge to a Stationary distribution. The Markov Chain generated by Gibbs Sampler converges to the target density as the number of iterations become large. P(X) is the stationary distribution of the Gibbs sampler, i.e. $P(\phi) = \int P(\phi| X)P(X)dX = \int P(X, \phi) dX$. The reason is the following.

\begin{align}
\int P(\phi| X)P(X)dX = & \int d X_1 d X_2 P(\phi_1|X_2) P(\phi_2|\phi_1)P(X_1,X_2) \\
= & P(\phi_2|\phi_1) \int d X_2 P(\phi_1|X_2) \int d X_1 P(X_1,X_2) \\
= & P(\phi_2|\phi_1) \int d X_2 P(\phi_1|X_2) P(X_2) \\
= & P(\phi_2|\phi_1) P(\phi_1) \\
= & p(\phi) \\
\end{align}

#### Detailed Balance

(Principle of detailed balance) Transition probability matrix P and probability distribution $\pi$ are said to be in detailed balance, or equivalently, the principle of detailed balance is said to hold, if: $\pi_i p_{ij} = pi_j p_{ji} \forall_{i,j} \epsilon S$

Ultimately if the detailed balance equations are satisfied, it implies the markov chain is reversible and it has an invariant distribution. Gibbs sampler has detailed balance if it

scans through components in fixed order, then
scans through components in reverse order

\begin{align} P(X)P(\phi|X) = & P(X) \int d X’_1 P(X’_1 |X_2) P(\phi_2|X’_1) P(\phi_1|\phi_2)\\ = & P(X) P(\phi_1|\phi_2) \int d X’_1 P(X’_1 |X_2) P(\phi_2|X’_1)\\ = & P(X_1|X_2)P(\phi_1|\phi_2) \int d X’_1 P(X’_1) P(X_2|X’_1) P(\phi_2|X’_1)\\ = & P(\phi)P(X|\phi) {\rm~by~symmetry} \end{align}

Gibbs Algorithm is a special case of Metropolis-Hastings Algorithm,i.e Metropolis is a generalization for sampling . Metropolis sampling algorithm can draw samples from a distribution of complex random Variable $X$. $X$ could be a stochastic process or e.g. a process of formation of random network model or whole social diffusion process.

### Metropolis-Hastings Sampling

In order to work with Metropolis-Hastings algorithm our only requirement is that we should be able to write down the $P(X)$ up to some scaling factor.

The Metropolis algorithm first proposes a possible new state $X^{*}$ in the Markov chain, based on a previous state $X^{(t-1)}$, according to the proposal distribution $q(X^{*} | X^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $P(X)$ evaluated at $X^{*}$ .

One constraint of the Metropolis sampler is that the proposal distribution $q(X^{*} | X^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time t, the probability of moving from $X^{(t-1)}$ $\rightarrow$ $X^{(t)}$ must be equal to the probability of moving from $X^{(t-1)}$ $\rightarrow$ $X^{(t)}$ , a condition known as reversibility or detailed balance.

Metropolis-Hastings algorithm implements an additional correction factor \textit{c}, defined from the proposal distribution as
$$c = \frac{q(X^{t-1} | X^{(*)})}{q(X^{*} | X^{(t-1)})}$$

The correction factor adjusts the transition operator to ensure that probability of moving from $q(X^{*} | X^{(t-1)})$ is equal to the probability of moving from $q(X^{*} | X^{(t-1)})$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$. Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set $t$ = 0, generate an initial state $X^{(0)}$ $\sim$ $\pi^{(0)}$
2. repeat until $t$ = $M$
3. set $t$ = $t$+1
4. generate a proposal state $X^{*}$ from $q(X | X^{(t-1)})$
5. calculate the proposal correction factor $c = \frac{q(X^{t-1} | X^{(*)})}{q(X^{*} | X^{(t-1)})}$
6. calculate the acceptance probability $\alpha = \min \left(1,\frac{P(X^{*})}{P(X^{(t-1})}\times c \right )$
7. draw a random number $u$ from $Unif(0,1)$
1. if $u \leq \alpha$ accept the proposal state $X^{*}$ and set $X^{(t)} = X^{*}$
2. else set $X^{(t)} = X^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

#### Metropolis-Hastings Sampling — Example

mh.gamma <- function(n.sims, start, burnin, cand.sd, shape, rate) {
draws <- start
theta.update <- function(theta.cur, shape, rate) {
theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd)
accept.prob <- dgamma(theta.can, shape = shape, rate = rate)/dgamma(theta.cur, shape = shape, rate = rate)
if (runif(1) <= accept.prob) theta.can else theta.cur
}
for (i in 2:n.sims) draws = c(draws, theta.update(draws[i-1], shape = shape, rate = rate))
return(draws[(burnin + 1):n.sims])
}
mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2, shape = 2, rate = 4)
plot(density(mh.draws))
lines(seq(0,2,length.out = 100), dgamma(seq(0,2,length.out = 100), shape=2, rate=4), col='red')
legend('topright',lty=1, col=c('black','red'), legend=c('M-H','gamma(shape=2,rate=4)'))


#### Detailed Balance of Metropolis-Hastings algorithm

$P(\phi|X) = q(X,\phi)\alpha(X, \phi)$, if $X\ne\phi$

$P(X|X) = 1-\int d\phi q(X,\phi)\alpha(X,\phi)$

Hence
\begin{align}
& P({Y\le\phi}|X) = \int_{-\infty}^\phi dY q(X,Y)\alpha(X,Y) + 1(\phi\ge X)\left(1-\int dY q(X,Y)\alpha(X,Y)\right)\\
\Rightarrow & p(X, \phi) = {d P({Y\le\phi}|X) \over d\phi} = q(X,\phi)\alpha(X,\phi) + \delta(X-\phi)\left(1- \int dY q(X,Y)\alpha(X,Y)\right)\\
\Rightarrow & P(X)p(X,\phi) = P(X) q(X,\phi) \min \left(1,{P(\phi) q(\phi, X) \over P(X) q(X, \phi)} \right) \\
& +\delta(X-\phi)\left(P(X)-\int dY P(X) q(X,Y) \min \left(1,{P(Y) q(Y, X) \over P(X) q(X, Y)} \right) \right)\\
\Rightarrow & P(X)p(X,\phi) = \min \left(P(X) q(X,\phi),P(\phi) q(\phi, X) \right)
+\delta(X-\phi)\left(P(X)-\int\min dY \left(P(X) q(X,Y), P(Y) q(Y, X) \right) \right)\\
& =P(\phi)p(\phi, X) {\rm~by~symmetry}
\end{align}

### Gibbs Sampling from Hidden Markov Models

A hidden markov model is defined in terms of the following 3 latent states: $X(t)\in \{1,\cdots,k\}$, State transition kernel $p_{ij} = P(X_{t+1}=j|X_t=i)$ and the initial state $X_0$, and Observation model $P(Y_t|X_t)$

One simple way to sample from the hidden Markov model is to iterate through every sample $X_t$ for $t = 1,\cdots,T$. Here we fix everything else and write down the probability of $X_t$ as given below:

\begin{align*} & P(X_t | X_1,\cdots,X_{t-1},X_{t+1},\cdots,X_T,Y_1,\cdots,Y_T)\\ =& (P(X_1,\cdots,X_T,Y_1,\cdots,Y_T))⁄(\sum_{X_t} P(X_1,\cdots,X_T,Y_1,\cdots,Y_T))\\ \end{align*}

However in order to make an efficient sampling algorithm it has to satisfy the following 2 conditions:

• Sample points should diffuse quickly inside the sample space.
• The equilibrium distribution of the sample should be very similar to the distribution that is desired.

However the problem with the above sampling algorithms is that the points will diffuse slowly. This is because we fix the previous values and the next values at every state. This means although the above algorithm works, it takes infinite amount of time for it to ultimately mix well to be able to explore the whole probability space. To avoid this, we use the following algorithm:

Forward Filtering Backward Sampling Algorithm

1. We first conduct the forward sweep and extract the forward statistic $\alpha_t(X_t)= P(X_t|Y_1,\cdots,Y_t)$ for $t=1,2,\cdots$
2. Next step is the backward sampling stage. Here, we sample $X_T$ first and so on until $X_0$

\begin{align*} & P(X_{t-1}│X_t,Y_1,\cdots,Y_T)\\ =& \frac{P(X_{t-1},X_t,Y_1,\cdots,Y_T)}{P(X_t,Y_1,\cdots,Y_T)}\\ =& \frac{P(X_{t-1},Y_1,\cdots,Y_{t-1})P(X_t|X_{t-1})P(Y_t,\cdots,Y_T|X_t)}{\sum_{X_{t-1}} P(X_{t-1},Y_1,\cdots,Y_{t-1})P(X_t|X_{t-1})P(Y_t,\cdots,Y_T|X_t)}\\ =& \frac{\alpha_{t-1}(X_{t-1})P(X_t|X_{t-1})P(Y_t|X_t)\beta_t(x_t)}{\alpha_{t-1}(X_{t-1})P(X_t|X_{t-1})P(Y_t|X_t)\beta_t(x_t)} \end{align*}

# Sampling Multivariates and Processes [10/08]

Prepared by Zhu Jin, Srinivasan Rajappa & Yuhui Wu

### Stochastic Kinetic Model

Stochastic Kinetic Model is a highly multivariate jump process used to model chemical reaction networks, particularly those in biochemical and cellular systems.

This kind of model represents temporal evolution of a system among M populations interacting with one another through a set of $v$ events.

• V events happen with rates $h_k(x_t,c_k)$ where $k=1,…, v$. $c_k$ is rate constant ($k$ is a index of events) and $x_t$ represents M population
• Each event k changes population by $Δ_t$

The  Gillespie algorithm is as following:

1. At time $t=0$, the system is initialized into the initial state:
$(x(1),⋯x(M))←(x(1)0,⋯x(M)0)$.
2. Sample time to the next event no matter what event it is:
$τ∼Exponential(h_0(x,c) = \sum_{k=1}^v h_k(x,c_k))$.
3. sample the specific event on condition that there is event happening (the minimum of a set of exponential distributions is exponential distribution with the rate which is the sum of the event rates of those individual distributions. On condition that we have event which happens at the minimum of the times of the distribution then we can sample what specific event is happening), and this is a categorical distribution:
$k∼categorical(h_1h_0,…,h_vh_0)$.
4. Update time and the states of the populations: $t←t+τ, x←x+Δk$.
5. Repeat steps 2-5 until the termination condition is satisfied.

The reason that this algorithm sample many events, take the minimum and then move on is that the exponential distribution is used, which is a memory-less distribution, and at any time of the next event, there is no memory about what was done in the past.

This kind of models that involves different populations and that involves a set of events. It is common in both examples that involves a set of events. The events happen with different event rates and the rate involves a different rate constants. The rates also involve the current state of the system. The state of the system is expressed in terms of the different populations.

Susceptible-infectious-susceptible dynamics:

• $S+I→2I, rate = \alpha\cdot |S|\cdot |I|$
• $I→S, rate = \beta\cdot |I|$

For example, there are two different populations in susceptible dynamics – susceptible population and infectious population. These two populations interact with each other through two events – infection and recovery. The probability of getting infected is $R$ when an individual from susceptible population interact with one from infectious population. Since system has $S$ number of susceptible individuals and I number of infectious individuals then the total number of combinations that we can have if we choose one susceptible individual and one infectious individual is going to be $S*I$, the susceptible population times the infectious population. The total rate of the whole system is going to be alpha times susceptible population times the infectious population. It depends on how we look at it. If we look at it at the individual level then the rate is alpha but if we look at it at the system level then the rate is going to be larger than alpha. Similarly we have another event called recovery. The same happens here and also we know that after an infection event called the susceptible in the population is decreased by one and the infectious population is increased by one. After the recovery event the infectious population is decreased by one and the susceptible in the population is increased by one. This can actually be calculated from the chemical equations. Like we have before the reaction we have one I and after the reaction we have two Is. This means that I has increased by one. Here before the reaction we have one S and after the reaction we have zero $S$. This means that $S$ is decreased by one. The rate is defined per time united and also could be time unit or a function of time.

A lot of sampling-based algorithms as has been mentioned, but there is a challenge that condition on the observations, the sampling does not work so easy, because the sampling algorithm does not assume that there are observations. Thus, the Likelihood of Finite State Continuous Time Markov Process with Complete Observation is introduced.

### Sample rate constants from complete path stochastic kinetic dynamics

In a system from time $1$ to time $T$, there are $n$ events happened: $v_1,v_2,…,v_n$. Those events have happened at corresponding times $t_1,t_2,…,t_n$. At each time this event $vi$ happened with rate computed from the time of the previous event as

$$h_{v_i}(x_{t_{i−1}},c_{v_i})$$

and rate of all events is

$$h_0(x{t_{i−1}},c)=∑_k h_k(x_{t_{i−1}},c)$$

changing state from $x_0$ at time $t_0=0$ to $x_1,x_2,…,x_n.$

The state of this system is always $t_{i-1}$, and between $t_{i-1}$ to $t_i$ the rate of any event to happen is $h_0$ which is the sum of the rate for the individual events. The probability for this set of events and the corresponding happening times is going to be the probability distribution that the time to next event is from $t_i$ to $t_{i-1}$.

So the probability density function is

$$h_0(x_{t_{i−1}})exp(−h_0(x_{t_{i−1}},c)⋅(t_i−t_{i−1}))$$

This is a category distribution which suppose that there is an event happening then what is the event that happened? It’s probability is:

$$h_{v_i}(x_{t_{i−1}},c_{v_i})h_0(x_{t_{i−1}},c)$$

After the $n$th event and between the time of the $n$th event to this $t$, there is no further event, the probability of this is

$$exp(−h_0(x_{t_n},c)⋅(t_T−t_n))$$

Those are the three parts in the likelihood function. If the $h_0$ in the pdf function of the next event with $h_0$ of the multinomial distribution is cancelled, all of the $h_{v_i}$ are left over, as:

$$\prod_{i} h_{v_i}(x_{t_{i−1}},c_{v_i})$$

And then if the exponential distributions is merged, there will be $H_0$ from time $0$ to time $T$:

$$exp(−\sum_i h_0(x_{t_i},c)⋅(t_{i+1}−t_i))$$

So in the likelihood there are information about the jumps ($\prod_{i} h_{v_i}(x_{t_{i−1}},c_{v_i})$) and also the nonjumps ($exp(−\sum_i h_0(x_{t_i},c)⋅(t_{i+1}−t_i))$).

If $$h_k(x_t,c_k)=c_kg_k(x_t),$$

there is
$$L(c;x)=\prod_ic_{v_i}g_{v_i}(x_{t_{i−1}})exp(−\sum_i\sum_kc_kg_k(x_{t_i})⋅(t_{i+1}−t_i))$$$$∝\prod_{k∈events}c^{rk}_k exp(−\sum_{k∈events}c_k∫_t dtg_k(x_t))$$ where $$r_k=\sum_i 1(v_i=k)$$

Hence MLE of $c_k$ is the expected number of event $k$ to happen in unit time

$$c_k=r_k∫_tdtg_k(x_t)$$

### Sample from a proposal process with piecewise constant/linear event rate

An inhomogeneous or non-homogeneous Poisson process is similar to an ordinary Poisson process. The average rate of arrivals $\lambda(t)$ varies over time or in other words it can be understood that rate parameter of the process is a function of time.

In the example of Predator-Prey model, the population of predator/prey observed over time can be exhibiting a rate similar to $\lambda(t)$. In such situation where random phenomena occurs we can account this to Inhomogeneous-Poisson process.

Property of Independent Increments: Suppose that we have a random process $X=\{X_t: t \in T\}$ such that index set $T$ is either $N$ or $[0,\infty)$. Here X has independent increments if for $t_1, t_2, \cdots, t_n \in T$ with $t_1\le t_2 \cdots \le t_n$, the increments $X_{t_1}, X_{t_2}-X_{t_1}, \cdots , X_{t_n}-X_{t_n-1}$ are all independent.

A process that produces random points is a non-homogeneous Poisson process with rate $\lambda(t)$ if the counting process $N$ satisfies the following properties:

1. If $\{A_i:i \in I \}$ is a countable, disjoint collection of measurable subsets of $[0,\infty)$ then $\{N(A_i): i \in I\}$ is a collection of independent random variables.
2. If $A \subseteq[0,\infty)$ is measurable then $N(A)$ has the Poisson distribution with parameter m(A).

Some other properties include the following: For finding the number of events between $(t,t+dt]$ we can apply the same properties of poisson process for $\lambda(t)$. i.e.
$N_t(dt) \sim \mbox{Poisson}(\lambda(t) dt)$

This property gives an insight that sum of independent Poisson random variables is a Poisson with rate equaling sum of individual rates at any time $t$.

Now having said this, in contrast homogeneous Poisson process have event observations really close to one another. If not then this means that the rate cannot be assumed to be a constant. So if we apply the same notion for a non-homogeneous Poisson process then we may have several plausible paths between event rate at $t_1$ and event rate at $t_2$. The path refers to the variable rate that may be assumed at will.

Thus this brings us to objective of finding the best fit for a non-homogeneous Poisson process that will estimate the variable rate between two major observations. This engenders the requirement of sampling the random variable. The technique adopted is as follows:

1. First samples from the average of distribution is taken.
2. Later the time is adjusted by stretching there by accounting an accurate depiction of various time involved.

The sampling is done according to a uniform distribution. The time stretching exploits the fact that sum of Poisson random variable is again a Poisson random variable. This makes a sound basis for asserting the samples on a path between two observations.

This helps to understand the inhomogeneous in terms of homogeneous Poisson distribution. However, we have to mathematically understand in terms of $h$ function.

Assuming Y(t) is inhomogeneous Poisson process where $\lambda(t) = (1-t)\cdot h_0 + t\cdot h_1$ where $t \in (0,1)$ and assuming $X(s)$ is a homogeneous Poisson process with rate $(h_0 + h_1) / 2$ here $s$ is a state in the state space.

On solving using cumulative hazard function:

$M(s) = -\log(T_X > s) = s\cdot (h_0+h_1)/2$
and , $\delta(t) = h_0 \cdot t + t^2 (h_1-h_0)/2$

on inspection $\delta(t) = M(s) \Rightarrow t = (\sqrt{h_0^2 + (h_1^2 – h_0^2).s} – h_0)/(h_1 – h_0)$

Thus we can infer the following:

$X(\sqrt{h_0^2 + (h_1^2 – h_0^2).s} – h_0)/(h_1 – h_0))$ has distribution of $Y(t)$

The above proof shows the complete observation, complete path based on the two stochastic connected processes. This also helped to find probability path based on our sampling methods. Based on an accept reject filter we may be able to filter the correct probability path. This forms the basis for next sample.

Thus we are able to accept a sample for the event rates between two different observations within a time period.

### Metropolis-Hastings Sampling from Discrete Time Observations of Finite State Continuous Time Markov Process

Sample from finite state continuous time Markov process and discrete observations:

• $X_t\in{1,\dots,k}$
• State transition kernel $q_{i,j}=lim_{t\to 0} {(P(X_{t+\Delta t}=j|X_t=i)-1(i\equiv j)\over \Delta t}$ and initial state $X_0$
• Observation model $P(Y_t|X_t)$
• Observations $Y_0, Y_1, \dots, Y_T$

Plan:

• Initialise sample path to be consistent with the observed data. Repeat the following
• Sample rate constants from their full conditionals given the current sample path.
• For each of the T intervals, propose a new sample path consistent with its end-points and accept/reject it with a Metropolis-Hastings step
• Output the current rate constants.

Sample between Observations Assuming Homogeneous Event Rates
Now we begin to introduce the conception of Homogeneous Poisson process, the basic form of Poisson process is: Counting process $\{N(t),t\ge0\}$ is defined as Poisson process with rate $\lambda$, if:

• $N(0)=0$
• Independent increments (the numbers of occurrences counted in disjoint intervals are independent of each other)
• for any $s,t\ge0$, $P\{N(t+s)-N(s)=n\} = {(\lambda t)^n e^{-\lambda t}\over n!}, n=0,1,2, …$

Consider a homogeneous Poisson process with rate $\lambda$ on the finite interval $[0,T]$. Then the number of events happening in this interval has Poisson distribution, $R\sim{\rm Poisson}(\lambda T)$. In addition, the event times conditioned on $R$ have uniform distribution $U(0,T)$.
For any given $\lambda,p>0$, let $N\sim{\rm Poisson}(\lambda)$ and $X|N\sim{\rm Binomial}(N,p)$. Then $X\sim{\rm Poisson}(\lambda p)$
$P(X=k) = \sum_{i=k}^\infty P(X=k|N=i)P(N=i)\\ = \sum_{i=k}^\infty {i!\over k! (i-k)!} p^k {(1-p)}^{i-k}\times {\lambda^i e^{-\lambda}\over i!}\\ = \dots = {{(\lambda p)}^ke^{-\lambda p}\over k!}$
The following graph shows that Poisson process’s sample path:

$T_{n}, n=1,2,3,…$ is the nth event’s happening time, let $T_{0}=0$. $X_{n}, n=1,2,3,…$ is the time interval between (n-1)th event and nth event.
Consider a homogeneous Poisson process with rate $\lambda$ on the finite interval $[0,T]$ Then the number of events happening in this interval has Poisson distribution, $R\sim{\rm Poisson}(\lambda T)$. In addition, the event times conditioned on $R$ have uniform distribution $U(0,T)$.
Given $N(t)=n$, the $n$ jumps $T_1,\cdots,T_n$ are uniformly distributed on interval $(0,t)$.
First, considering situation when n=1, which means event only happens 1 time in time interval [0,t], for $s\le t$,
$\mbox{P}(T_{1}\le s|N(t)=1)=\frac{\mbox{P}(N(s)=1)\mbox{P}(N(t)-N(s)=0)}{\mbox{P}(N(t)=1)}=\frac{\lambda se^{-\lambda s}e^{-\lambda (t-s)}}{\lambda te^{-\lambda t}}=\frac{s}{t}$
Which indicates that event happens uniformly for n=1, now we consider situation for n$\ge$2:
$\mbox{P}(t_{i}<T_{i}\le t_{i}+h_{i},i=1,\cdots,n|N(t)=n)\\$
$=\frac{\mbox{P}(\mbox{exactly 1 event in }(t_{i},t_{i}+h_{i}],i=1,\cdots,n,\mbox{ no events elsewhere in }(0,t))}{\mbox{P}(N(t)=n)}\\$
$= \frac{\lambda h_{1}\exp(-\lambda h_{1})\cdots\lambda h_{n}\exp\left(-\lambda h_{n}\right)\cdot\exp\left(-\lambda(t-h_{1}-\cdots-h_{n}\right)}{\exp\left(-\lambda t\right)(\lambda t)^{n}/n!}\\ = \frac{n!}{t^{n}}h_{1}\cdot h_{2}\cdot\dots\cdot h_{n}.$
Taking derivative over intervals $h_1,\cdots,h_n$, we get the desired results.

# Factorial Hidden Markov Model [09/29]

Prepared by Alireza Farasat, Simranjot Singh, Devesh Mulmule.

The Hidden Markov model (HMM) is one of the basic statistical tools for modeling discrete time series. In an HMM, information about the past is conveyed through a single discrete variable—the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner.
An HMM is essentially a mixture model, encoding information about the history of a time series in the value of a single multinomial variable—the hidden state—which can take on one of K discrete values. This multinomial assumption supports an efficient parameter estimation algorithm—the Baum-Welch algorithm—which considers each of the K settings of the hidden state at each time step. However, the multinomial assumption also severely limits the representational capacity of HMMs. For example, to represent 30 bits of information about the history of a time sequence, an HMM would need K = 230 distinct states. On the other hand, an HMM with a distributed state representation could achieve the same task with 30 binary state variables.In the following section we define the probabilistic model for factorial HMMs.

### Factorial Hidden Markov Models — Generative Description

We begin by describing the Hidden Markov model, in which a sequence of observations
$\left\{Y_t\right\}$ where t = 1,…T, is modeled by specifying a probabilistic relation between the observations and a sequence of hidden states $\left\{X_t\right\}$, and a Markov transition structure linking the hidden states. The model assumes two sets of conditional independence relations:

• that $Y_t$ is independent of all other observations and states given $X_t$,
• and that $X_t$ is independent of $x_1…x_t−2$ given $S_t – 1$ (the first-order Markov property).

Factorial HMM looks like like presented in the above. Each x in the figure is a latent variable, is represented by a chain indexed by time. So we see that individual chances for each x represented with a superscript evolve as a Markov Chain. Each x could take discreet states like success, failure etc. At each time stamp we have observation that summarizes over the x over all chains. The goals is to make inference about individual chance of given the state transition matrix over all x and the observations. X is a hence a matrix of latent states where columns indexed by chains and rows indexed by time. Y is the observer.
Each Markov chain evolve independent of each other with each x taking one of the discreet states, d. Observation Y is hence a Gaussian distribution around linear combination of all X. Therefore the total number of states that X can take is $d^M$ (M: number of Markov chains).
Using these independence relations, the joint probability for the sequence of states and observations can be factored as.

P(\left\{X_t, Y_t\right\}) = P(X_1)P(Y_1|X_1)\prod_{t=2}^{T}P(X_t|X_t−1)P(Y_t|X_t)

In factorial HMMs the underlying state transitions are
constrained. A natural structure to consider is one in which each state variable evolves according to its own dynamics, and is a priori uncoupled from the other state variables:

P(\left\{X_t| X_{t-1}\right\})= \prod_{m=1}^{M}P(X^{(m)}_t |X^{(m)}_{t−1})

In a factorial HMM the observation at time step t can depend on all the state variables at that time step. As mentioned before for continuous observations, one simple form for this dependence is linear Gaussian; that is, the observation $Y_t$ is a Gaussian random vector whose mean is a linear function of the state variables. We represent the state variables as K × 1 vectors, where each of the K discrete values corresponds to a 1 in one position and 0 elsewhere. The resulting probability density for a $D × 1$ observation vector $Y_t$ is

$$\begin{eqnarray*} &&P(Y_t| X_t)= |C|^{−1/2}(2\pi)^{−D/2}exp \left\{\frac{-1}{2}(Y_t − µ_t)’C^{−1} (Y_t − µ_t)\right\},\\ \mbox{where} && \mu_t = \sum_{m=1}^{M}W^{(m)}X^{(m)} \end{eqnarray*}$$
Each $W^{(m)}$ matrix is a $D\times K$ matrix whose columns are the contributions to the means for each of the settings of $X^{(m)}_t$ , C is the D × D covariance matrix.

The joint log likelihood for the sequence of states and observations hence can be represented as

\begin{eqnarray*}
&& \log P(\left\{X_t, Y_t\right\}) = -\frac{1}{2}\sum_{t=1}^{T}log P(Y_t|X_t) + \sum_{t=1}^{T}\sum_{m=1}^{M}log P(X^{(m)}_t|X^{(m)}_{t-1}) −\frac{T}{2}log|C| -\frac{d.T}{2}log(2\pi),\\
\mbox{where} && \log P(Y_t|X_t) = Y^T_tC^{−1}Y_t – 2\sum_{m=1}^{M}Y^T_tC^{−1}W^{(m)}X^m_t + \sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{-1}W^{(n)}X^{(n)}_tX^{(m)T}_t),\\
\mbox{and} && \log P(X^{(m)}_t|X^{(m)}_{t-1}) = logP^{(m)}X^{(m)}_tX^{(m)T}_{t-1}
\end{eqnarray*}

where $tr(J)$ represents the trace of matrix J.

Trace Basics

Let A be $n \times n$ matrix. Then its trace is given by $Tr(A) = \sum_i A_{ii}$.

Some properties/lemma of Trace:

• $Tr(A+B) = Tr(A) + Tr(B)$
• $Tr(c.A) = c.Tr(A)$
• $Tr(A^{T}) = Tr(A)$ —diagonal elements are same in $A^{T}$
• $Tr(A.B) = Tr(B.A)$

$$\begin{eqnarray*} \mbox{Tr}(A.B) & = & a_{11}.b_{11}+a_{12}.b_{21}+\cdots+a_{1n}.b_{n1}\\ & & +a_{21}.b_{12}+a_{22}.b_{22}+\cdots+a_{2n}.b_{n2}\\ & & +\cdots\\ & & +a_{n1}.b_{1n}+a_{n2}.b_{2n}…+a_{nn}.b_{nn}\\ \mbox{Tr}(B.A) & = & b_{11}.a_{11}+b_{12}.a_{21}…+b_{1n}.a_{n1}\\ & & +\cdots\\ & & +b_{n1}.a_{1n}+b_{n2}.a_{2n}…+b_{nn}.a_{nn} \end{eqnarray*}$$

Hence, $Tr(A.B) = Tr(B.A)$

• $Tr(X^{T}Y) = Tr((X^{T}Y)^{T}) = Tr(Y^{T}X)$ —since $Tr(A^{T}) = Tr(A)$
• $Tr(XY^{T}) = Tr((XY^{T})^{T}) = Tr(YX^{T})$
• $Tr(X^{T}Y) = Tr(YX^{T}) = Tr(Y^{T}X) = Tr(XY^{T}) = \sum_{ij}X_{ij}Y_{ij}$, which is the dot product of two matrices.
• Jacobi’s formula for determinant of matrix is: $\frac {d|A|}{dA^{-1}} = A$

### Exact Inference of Factorial Hidden Markov Models

Doing exact inference is on Factorial HMM is pretty intractable, especially with a large number of chains where the number of states start exploding. If we do not observe Y in HMM then the chains of X evolve independent of each other. However if we make an observation, then the chains can be seen related as well. This can be explained by the Berkison Paradox — If we restrict ourselves to the cases where events A or B occur where least one of the events A or B occurs, the knowledge that B has occurred makes it less likely that A has occurred or
$P(A | B, A or B) < P(A | A or B)$ where $0 < P(A), P(B) < 1$ (which means that the events have to be interesting enough that $A/B$ do not occur with certainty or never).

Mathematically, it follows that, given $X^{(m)}$ and $X^{(n)}$ for $m≠n$ are independent, then $X^{(m)}_t$ and $X{(n)}_t$ for $m\neq n$ become dependent given observation $Y^{(m)}_t$ about them.
Hence exact inference involves enumerating all $(X^{(1)},X^{(2)},…,X^{(M)})$ — become computationally intractable when M is large.
Suppose a university the admission committee recruits brainy or sporty candidates. A student can either be brainy or sporty or both or neither, however the probability of either is independent. Now, we see that the college normally recruits brainy students and not sporty students. It follows that given that a student is brainy, he is less likely to be sporty. This provides that although the chances X in Factorial Hidden Markov Model are uncorrelated (i.e they evolve independently), but after we perform observation over all the states then the chances can be seen as correlated. Hence we have to estimate joint state of all of these chances together (condition and observation) which leads to an exploding number of states. Say each chain can take 2 states and we have 50 chains then the resultant number here will be $2^{50}$.

### Naive Mean Field Approximation

From the exact inference (and Berkson paradox) on factorial Hidden Markov Model, if we do not observe $Y_t$ (at time slice t), then the latent states(chains at t) $X_t^{1}, X_t^{2}…. X_t^{M}$ will evolve independent of each other. But if $Y_t$ is available, then chains are correlated; and hence the number of states will explode. As an example, if a state can take upto K values and M chains are present, then the factorial HMM would require $K^M$ states (huge space!), given that observation is available. So, when $Y_t$ is available, then for two latent states $X_t^{m}$, $X_t^{n}$ we can say, $P(X_t^{m}, X_t^{n}/Y_t) \neq P(X_t^{m}|Y_t) \cdot P(X_t^{n}|Y_t)$. Since, we cannot cope with this joint probability distribution, we need a simpler distribution.

The naive mean field approximation is given as distribution Q which approximates the actual distribution of  $X_t^{m}$’s with parameters $\theta_t$ for each of the chains m. We can express this approximation as: $Q(X_t/\theta) = \prod_t \prod_m Q(X_t^{m}/\theta^{m})$. In other words, we assume that under $Q$ the distribution of $X_t^{m}$ with parameter $\theta_t^{m}$,is independent of other $X_t^{n}$ where $m \neq n$ and for each chain m, $X_t^{m}$ is independent of $Y_t$. By this assumption, we get rid of all the dependencies in factorial HMM between the chains and ii)between states and observation. Thus, the complexity of the model is highly reduced.

This distribution Q belongs to the exponential family; Given that a state can take up to $K$ values, then for a chain m, this approximate distribution can be written as: $Q(X_t^{m}/\theta) = \prod_k (\theta_{t,k}^{m})^{X_{t,k}^{m}} = exp(X_t^{(m)T}. log\theta_t^{m})$, where $\sum_k \theta_{t.k}^{m} = 1$.
Here, the $X_t^{(m)T}$ is a unit row vector $(X_{t,1}^{(m)}, X_{t,2}^{(m)}, .. X_{t,K}^{(m)})$ where only one value would be 1 and the rest would be zero, since at time t $X_t^{m}$ can exist in only of the K states. Thus, at the end, we will only have one $log\theta_{t,k}^{m}$ value.

The ultimate goal is to determine the parameters $\theta_t^{m}$ for a chain m at time t by minimizing the KL divergence. For computing this KL divergence, some statistics for this distribution Q are defined as follows:

• Expected value of Q : $E_Q X_t^{m} = \theta_t^{m}$
• Variance of $X_{t-1}^{m}$ and $X_{t}^{m}$: $E_Q X_{t-1}^{m} X_{t}^{(m)T} = \theta_{t-1}^{m}\theta_t^{(m)T}$
• Co-Variance of $X_{t-1}^{m}$ and $X_{t}^{m}$:
• $E_Q X_t^{m} X_t^{(n)T} = \theta_t^{m} \theta_t^{(n)T}$—for $m \neq n$
• $= diag(\theta_t^{m})$—-for $m=n$

(where diag() is an operator that takes a vector and returns a square matrix with the elements of the vector along its diagonal, and zeros everywhere else.)

Now we know that log likelihood of actual distribution P is given as:

\begin{eqnarray*}
\log P(X_t,Y_t) &=& -1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} X_t^{m}\\
&&+\sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{−1}W^{n}X_t^{n}X_t^{(m)T}))\\
&&+\sum_{t=1}^{T} \sum_{m=1}^{M} tr(\log P^{m} X_t^{m} X_{t-1}^{(m)T})-(T/2) \log |C|-(d.T/2) \log (2.\pi)
\end{eqnarray*}

The KL divergence is the amount of information lost by expressing the actual distribution ‘P’ as the approximate distribution ‘Q’. In terms of entropy it can defined as: $KL(Q||P) = E_Q \log (Q/P)= E_Q \log Q -E_Q \log P$. From the definition of $Q(X_t^{m}|\theta)$, $E_Q \log Q = \sum_t \sum_m \theta_t^{(m)T} \log \theta_t^{m}$.

From the variance and covariance statistics of $E_Q$ from equation (2) and (3), we can write $E_Q log P$ as:

$E_Q log P = -1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} \theta_t^{m} +\sum_{m!=n} tr(W^{(m)T}C^{−1}W^{n}\theta_t^{n}theta_t^{(m)T})$$+\sum_{n=1}^{M} tr(W^{(m)T}C^{−1}W^{m} diag (theta_t^{(m)T})))$$+\sum_{t=1}^{T} \sum_{m=1}^{M} tr(log P^{m} \theta_t^{m} \theta_{t-1}^{(m)T})-(T/2) log |C|-(d.T/2) log (2.\pi)$

Here we split out the second term in $E_Q log P$ which is $\sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{−1}W^{n}X_t^{n}X_t^{(m)T})$ into two terms considering the co-variance statistics of distribution ‘Q’.\\

Thus we have,

\begin{eqnarray*}
KL(Q||P) &=& \sum_t \sum_m \theta_t^{(m)T} \log \theta_t^{m} +1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} \theta_t^{m}\\
&& +\sum_{m\ne n}tr(W^{(m)T}C^{-1}W^{n}\theta_{t}^{n}\theta_{t}^{(m)T})+\sum_{n=1}^{M}tr(W^{(m)T}C^{-1}W^{m}diag(\theta^{m}_{t}))\\
&& -\sum_{t=1}^{T} \sum_{m=1}^{M} tr(log P^{m} \theta_t^{m} \theta_{t-1}^{(m)T})+(T/2) log |C|+(d.T/2) log (2.\pi)
\end{eqnarray*}

Differentiating the above KL divergence w.r.t $\theta_t^{m}$ and equating to 0 we get,
$\log \theta_t^{m}-W^{(m)T}C^{-1}Y_t +\sum_n W^{(m)T}C^{-1}W^{n}\theta_t^{n} +1/2 diag(W^{(m)T}C^{-1}W^{m})-log P^{m}\theta_{t+1}^{m}-log P^{(m)T}\theta_{t-1}^{m} +const = 0$

Hence,

$\theta_t^{m} \propto \exp (W^{(m)T}C^{-1}Y_t -\sum_n W^{(m)T}C^{-1}W^{n}\theta_t^{n} -1/2 diag(W^{(m)T}C^{-1}W^{m})+log P^{m}\theta_{t+1}^{m}+\log P^{(m)T}\theta_{t-1}^{m})$

Then we use fixed point algorithm to get the values of $\theta$, wherein we iterate over the values of $\theta_t^{m}$ until they converge.

### Structured Approximation

Mean field approximation presented in the previous section factors the posterior probability such that all the state variables are statistically independent. In contrast to this rather extreme factorization there exists a more realistic approximation that is both tractable and preserves much of the probabilistic structure of the original system. In this scheme the factorial HMM is approximated by M uncoupled HMMs as shown in Figure 2.

#### Maximizing Log-Likelihood of Variational Distribution

As described in previous chapters, efficient and exact inference is implemented using the forward–backward algorithm within each HMM. Note that the mean field approximation presented in the previous section does not specify the form of distribution. Hence, a structural approximation should be employed using more structural variational distribution $Q$. It is expected that $Q$ provides a lower bound on the log likelihood for obtaining a learning algorithm. The only dependency assumed in this method is the dependency of $X_t^{(m)}$ on $X_{t-}^{(m)}$ and the rest of random variables are assumed to be independent. Therefore, the structured variational approximation is defined as follows:

$Q(\{X_{t}\}|\theta)=\frac{1}{Z_{Q}}\prod_{m=1}^{M}\prod_{t=1}^{T}Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta),$

where $Z_Q$ is a normalization constant guaranteeing that $Q$ is a valid probability distribution. Probability of $X_t^{(m)}$ given parameter $\theta$ and $X_{t-}^{(m)}$ is defined as:

$Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)=\prod_{k=1}^{K}\left(h_{t,k}^{(m)}\sum_{j=1}^{K}X_{t-1,j}^{(m)}\cdot P_{j,k}^{(m)}\right)^{X_{t,k}^{(m)}}=\exp\left(\left(\log\vec{h}_{t}^{(m)}+\vec{X}_{t-1}^{(m)}\cdot\log P^{(m)}\right)^{\mbox{T}}\cdot\vec{X}_{t}^{(m)}\right),$

where $\vec{h}_{t}^{(m)} = \{h_{t,1}^{(m)}, h_{t,2}^{(m)},\ldots, h_{t,K}^{(m)}\}$ is a vector indicating the set of parameters corresponding to $\vec{X}_{t}^{(m)}$ (note that $h_{t,k}^{(m)}$ is the parameter related to $X_{t-1,k}^{(m)}$ recalling that $X_t^{(m)}$ can take on $K$ values). In addition, $P_{j,k}^{(m)}$ is the transition probabilities of $X_t^{(m)}$ from state $j$ to state $k$. It is easy to show that $Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)$ follows an exponential family distribution (please see \cite{bishop2006pattern} chapter 2).
To characterize $Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)$, lets define the following moments (expected values of $X_{t}^{(m)}$, $X_{t-1}^{(m)}X_{t}^{(m){\rm T}}$ and $X_{t}^{(m)}X_{t}^{(n){\rm T}}$:

$\mathbf{E}_{Q}X_{t}^{(m)}=\gamma_{t}^{(m)}$
– $\mathbf{E}_{Q}X_{t-1}^{(m)}X_{t}^{(m){\rm T}}=\xi_{t-1}^{(m)}$
– $\mathbf{E}_{Q}X_{t}^{(m)}X_{t}^{(n){\rm T}}=\begin{cases} \gamma_{t}^{(m)}\gamma_{t}^{(n)\mbox{T}} & m\ne n\\ \mbox{diag}\left(\gamma_{t}^{(m)}\right) & m=n \end{cases}$

By factorizing $Q(\{X_{t}\}|\theta)$, one can rewrite the log likelihood of the variational distribution as follows:
$-\log Q(\{X_{t}\}|\theta)=-\sum_{t}\sum_{m}\log\vec{h}_{t}^{(m)\mbox{T}}\cdot\vec{X}_{t}^{(m)}-\sum_{t}\sum_{m}\vec{X}_{t-1}^{(m)\mbox{T}}\cdot\log P_{j,k}^{(m)}\cdot\vec{X}_{t}^{(m)}$

Note that maximizing log-likelihood function is equivalent to minimizing -1$\times$log-likelihood function.

#### Minimizing KL-Divergence of Variational Distribution

Maximizing log-likelihood function is equivalent to minimizing KL-divergence. Assuming $P=P(\{X_{t}\}|\theta)$ and $Q=Q(\{X_{t}\}|\theta)$ represents true and variational distribution of $X_{t}$. KL-divergence is defined as follows:
\begin{eqnarray}
&&D_{KL}(Q\|P)=\mathbf{E}_{Q}\log Q-\mathbf{E}_{Q}\log P=\sum_{t}\sum_{m}\log\vec{h}_{t}^{(m)\mbox{T}}\cdot\vec{X}_{t}^{(m)}\\
&&+\frac{1}{2}\sum_{t=1}^{T}\left(Y_{t}^{\mbox{T}}C^{-1}Y_{t}-2\sum_{m=1}^{M}Y_{t}^{\mbox{T}}C^{-1}W^{(m)}\gamma_{t}^{m}+\sum_{m\ne n}\mbox{tr}\left(W^{(m)\mbox{T}}C^{-1}W^{(n)}\gamma_{t}^{(n)}\gamma_{t}^{(m)\mbox{T}}\right)+\sum_{m=1}^{M}\mbox{tr}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\mbox{diag}\left(\gamma_{t}^{(m)}\right)\right)\right)\\
&&-\log Z_Q+{T\over 2}\log |C| + {d\cdot T\over 2}\log (2\pi)\\
\end{eqnarray}
In order to minimize the KL-divergence, one need to take derivative respect to the parameters $h_{t}^{(n)}$. This also can be using EM algorithm. In this case, it is more convenient to take derivative respect to $\log h_{t}^{(n)}$ (Note that the solution to the KL-divergence remains the same. This can be verified using chain rule).
\begin{eqnarray}
&&\frac{\partial D_{KL}(Q\|P)}{\partial\log h_{\tau}^{(n)}}=\gamma_{\tau}^{(n)}+\sum_{t=1}^{T}\sum_{m=1}^{M}\underbrace{\left(\log h_{t}^{(m)}-W^{(m)\mbox{T}}C^{-1}Y_{t}+\sum_{l\ne m}W^{(m)\mbox{T}}C^{-1}W^{(l)}\gamma_{t}^{(l)}+\frac{1}{2}\mbox{diag}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\right)\right)}_{\frac{\partial}{\partial\gamma_{t}^{(m)}}}\frac{\partial\gamma_{t}^{(m)}}{\partial\log h_{\tau}^{(n)}}\\
\end{eqnarray}
Solving the obtained equations results in the exponential family form of $\log h_{t}^{(n)}$.
\begin{eqnarray}
&&\Rightarrow\log h_{t}^{(m)}\propto\exp\left(W^{(m)\mbox{T}}C^{-1}Y_{t}-\sum_{l\ne m}W^{(m)\mbox{T}}C^{-1}W^{(l)}\gamma_{t}^{(l)}-\frac{1}{2}\mbox{diag}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\right)\right)
\end{eqnarray}
Indeed, The parameter $h_{t}^{(m)}$ obtained from these fixed point equations is the observation probability associated with state variable $X_{t}^{(m)}$ in hidden Markov model $m$. Using these probabilities, the forward–backward algorithm is used to compute a new set of expectations for $X_{t}^{(m)}$.\\
The next step is maximization after finding the expected value of latent factors given the parameters. Indeed, minimizing KL-divergence to estimate the latent state includes finding out the expected value of latent states followed by, given the observations what values of parameters maximize the likelihood. How can one estimate the parameters? To estimate the parameters, one need to use maximum likelihood or maximum expected likelihood approach. In this case there exist two sets of parameters $W$ and $C^{-1}$ (Note that both $W$ and $C^{-1}$ are matrices). as mentioned before, $W$ is used to calculate the expected value of $Y_t|X_t$. If we take derivative respect to matrix $W$ and set the equation equals to zero, the result is:
\begin{eqnarray}
&&\frac{\partial P(\{X_{t},Y_{t}\})}{\partial W^{(m)}}=\sum_{t}\sum_{m}C^{-1}Y_{t}X_{t}^{(m)\mbox{T}}+C^{-1}W^{(n)}X_{t}^{(n)}X_{t}^{(m)\mbox{T}}\stackrel{\mbox{set}}{=}0\\&\Rightarrow&
W^{(n)}=\left(\sum_{t,m}Y_{t}X_{t}^{(m)\mbox{T}}\right)\left(\sum_{t,m}X_{t}^{(n)}X_{t}^{(m)\mbox{T}}\right)^{-1}=\left(\sum_{t}Y_tX_t^{\mbox{T}}\right)\left(\sum_{t}X_tX_t^{\mbox{T}}\right)\\
&&\mbox{where }X_{t}=\left(\begin{array}{c}
X_{t}^{(1)}\\
\vdots\\
X_{t}^{(M)}
\end{array}\right)
\end{eqnarray}

Similarly, precision matrix of the observation is estimated as follows:
\begin{eqnarray}
&\frac{\partial P(\{X_{t},Y_{t}\})}{\partial P^{(m)}}=\sum_{t=1}^{T}X_{t-1}^{(m)}X_{t}^{(m)\mbox{T}}\\\Rightarrow&P_{i,j}^{(m)}=\frac{\sum_{t}X_{t-1,i}^{(m)}X_{t,j}^{(m)}}{\sum_{t}X_{t-1,i}^{(m)}}
\end{eqnarray}

# Expectation Propagation [10/01]

Prepared by Ashish Gupta, Utsav Patel and Santhosh Somarapu.
Wen Dong will add the math of the lecture into this note or will turn the math into a problem in problem set 3/4.

Belief Propagation

Belief propagation, also known as sum-product message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes.

Suppose we have this graphical model shown above. We can convert this graphical model into a cluster graph / component graph / clique tree shown below by removing the directed edges as directed edges don’t matter when we apply the forward backward algorithm for inference . The different components are shown in the the diagram below.

Cluster Label : A,B,C,D in order

A more simplified version of the component graph is shown below. Each cluster has jurisdiction over some variables. For example, cluster A has jurisdiction over variable x1 , cluster B has jurisdictions over variables x1 and x2, cluster C has jurisdiction over variables x2 and x3 and so on. Now these components are going to talk to each other and try to convince each other about the state of the variable which they each have jurisdiction on in common. So cluster A is going to talk to cluster B about the about the state of the variable x1 and try to make cluster B more informed about x1 in order to get a better distribution over variable B.

All these clusters are going to talk to each other via message passing. Each cluster sends forward all the incoming message times its current belief to the next cluster as the message and this goes on until convergence, until the variable distributions don’t change much.

Approximate Belief Propagation

Let alpha denote the forward messages and beta denote the backward messages . Let both alpha and beta belong to exponential distributions. Given statistics, which is the expected value of those features under the distribution, we can find out those alpha. Given those alpha, the parameters, we can find out the expected value of those features. There’s a nice duality between the set of features and the parameters in the exponential family.

Exponential family is denoted by a set of features. Assuming exponential distribution for alpha(t-1) , we can propagate the information to alpha(t) and using this we can estimate the features for our exponential distribution and because of the duality between features and parameters for exponential distribution , we can estimate the parameters.Similarly we can propagate from beta(t) to beta(t-1) and estimate the features and the parameters.

Stochastic Kinetic Models

Stochastic kinetic models (SKMs) are a highly multivariate jump process used to model chemical reaction networks, particularly in biochemical and cellular systems.

An example of such system would be: A mixer of some molecules of different chemical species or reactants that are interacting and producing diffrerent chemical products. There can be many possible products generated from all processes and we want to know the possible composition of the mixer after some time.

[$\bullet$] we have a combination of M chemicals interacting in a system, we can specify the initial combination of the chemicals (at time t=0) as
$x_0 = \{x_0^{(1)}, x_0^{(2)}, … x_0^{(M)}\}$
where $x_0^{(m)}$ is the initial quantity of $m^{th}$ chemical.

[$\bullet$] At time t, the combination is $x_t = \{x_t^{(1)}, x_t^{(2)}, … x_t^{(M)}\}$

[$\bullet$]Among these chemicals, there can be many possible reactions that can occur and change the state of the chemical composition.

E.g.

$A + C + B \rightarrow D + E$
$C+ E\rightarrow B+A$
$A + E \rightarrow F$

[$\bullet$] We can assume that there are total V reactions and each reaction’s rate is dependent on the quantity of the reactant chemicals at time t ($x_t$) and a reaction constant $C_k$.

If the process contains just one reactant, the rate $h_k(x_k, c_k) \propto c_k * x_k$.

If there are more than one reactants then the rate $h_k(x_k, c_k) \propto c_k \prod_{i=1}^{M} x_k$.

Also each reaction k, changes the composition by $\Delta_k$. If the initial system is converted from $\alpha^{(1)}_k*X^{(1)}+ … +\alpha^{(1)}_k*X^{(M)} \rightarrow \beta^{(1)}_k*X^{(1)} + … +\beta ^{(1)}_k*X^{(M)}$
• Then, $\Delta_k = (\beta^{(1)}_k – \alpha^{(1)}_k, … ,\beta^{(M)}_k – \alpha^{(M)}_k)$
• So we want to estimate the state $x_t$ of the chemicals probabilistically using stochastic simulation.

Gillespie’s Algorithm Implementation
[$\bullet$] Given the initial state of the composition $x_0 = \{x_0^{(1)}, x_0^{(2)}, … x_0^{(M)}\}$, evaluate propensity function(or rate) for all reactions k that can happen at that time: $h_k(x_k, c_k)$.
[$\bullet$] From all possible reactions, choose one reaction randomly according to their weightage of rate ($\frac{h_k}{h_0}$) and the time $\tau$ untill the next reaction occurs.
[$\bullet$] Update the time and available composition of the mixer according to the used reactants and products. $t\leftarrow t+\tau, x\leftarrow x-\Delta_k$
[$\bullet$] Repeat step 2 and 3 as long as there are possible reactions that can occur from available composition. Or untill you arrive at the desired time to observe the system.
[$\bullet$] Repeating the process several times gives us additional information of how system behaves in intermediate states and what are the possible composition states that occurs frequently.
[$\bullet$] In this manner we can simulate the whole chemical chain reaction process as a stochastic model.
[$\bullet$] The same algorithm can be applied to many other examples such as:
given an initial state of predators and prays in an ecological system, what are the possible numbers of prays and predators that will thrive or survive in the system after some time.
[$\bullet$] Mathematically, the probability of a process k occurs with a rate $h(x_k, c_k)$ and after a time interval $\tau$ is be given by $P(k, \tau, x) = h(x_k, c_k)* exp(-(h(x_0, c_0)\tau)))$
[$\bullet$] Hence the joint probbility of the system over the whole time period is
$= \prod_{i=0}^n \Big[ h_{v_i}(x_{t_i}, c_{v_i})* exp(-h_0(x_{t_i}, c) * (t_{i+1} – t_i))\Big]$
$= \Big[\prod_{i=0}^n h_{v_i}(x_{t_i}, c_{v_i})\Big]* exp(\Sigma_{i=0}^n \{-h_0(x_{t_i}, c) * (t_{i+1} – t_i)\} )$

Stochastic Kinetic Model — Discretization
In order to make an inference about the observation of a system, and if we want to combine the observations with the dynamics, we’ll have to discretize the events, which is what a lot of people normally do in solving differential equations. Instead of sampling, we just increment the time step by one by one-time unit. We will not sample events.
Here, instead of sampling the next event, we first adding a new event, which means nothing has changed. We’re sampling whether it is a specific event happening, or whether it is a new event happening. Those are the two changes we have made in order to discretize this system.
We have applied the same idea previously in either throwing a coin repeated, or first estimate the time to the next successful, to the next failure. This is just the same. We estimate the time to the next success or next fail, the next weather change directly, change in the weather. All we just do, it’s a step-by-step, step-by-step, and say that’s time Y that hasn’t changed. Time 2 it hasn’t changed until sometime t, the weather changes. It’s the same idea. We can do it either way.
Based on this discretized, discrete-event model. Discrete event meaning they’re only finite at any given amount of time, there are only a finite number of events which have effect, changes of the system that is greater than something. This is always true. There are two different types of changes. One type of change is that in any small moment, we have infinite number of changes. Another change is discrete-event system.
If we count the number of events which are greater than some threshold, there will only be a finite number of events over any finite amount of time. Now we have discrete time, discrete-event model.

Projection — Moment Matching

Here we are coping with approximate messages passing. This means that we will first propagate forward step or in the backward step, we will first have some forward message alpha. Then we will propagate this forward message alpha over this system, so through this \psi, propagate the alpha through this \psi.

After we propagate over the psi, we will find out the sufficient statistics, and we’ll use the sufficient statistics to write new message in exponential form, and so on and so forth. Then Enumerate, all combinatorial states of those states, is pretty implausible. What we do is the following. The system looks like the following. Let’s write three components here.
t-1, t, t+1 and X1,X2,X3… and so on

After once I’ve sampled this event, then we will change the state those latent variables. You want to change the state of the different variables, we need to know the event, and we need to know the previous value of this latent variable. After we know the event, then we know the previous value. We know definitely how it to change this one. If we think of it in terms of inference, we need to enumerate all states, which is not possible. What we actually do is this. We will find the mean effect which is what we want to find.

Here, for example, I know that I’m currently the susceptible, and I have been in contact with 2 random people. What is the probability of that I will be infected? If I’m infected, then it will change my state according to this event. This is how this system works. In this we’ll have to enumerate the combinatorial states.

Here in this system, what we do is here I’m susceptible. Based on my average contact with the other persons, what is my probability of getting infected? This probability does not depend on the specific values of the events, this probability depends on which specific person I am in contact with. Here is what is the, for example, average amount of time I contact with this person? In this way, we decouple the relations of different events As a result, we will be able to make inference that is computationally traceable.

Projection — Parameter Learning

After finding the mean we need to estimate the parameters since in the system there are lot of noise observations and we don’t know the parameters so we need to work on the parameters from the noise observations. To find that first, we write down the probability, which is a factor as the probability after each time steps. After we write down the likelihood in X and Y and then log likelihood, we can write down the expected log likelihood. Of course, in order to write down the expected log likelihood, we need to know the probability distribution of the latent states. Since we do not know, so we just use our previous estimation of the probability distribution of the latent states what we will do is to take the partial derivative of the expected log likelihood over the parameters, which results in taking the partial derivative of this log of P over the parameter.

# CSE675 References

### 1. Simulation of Markov Processes

#### 1.1 What is simulation modeling and why should we care? Review of probability theory.

• Stochastic modelling for systems biology [1], Chapter 1: What is simulation modeling? Why simulation modeling in systems biology?
• The Big Book of Simulation Modeling: Multimethod Modeling with AnyLogic 6 [2], Chapter 1: What is simulation modeling? Why simulation modeling in studying complex social systems?
• Generative Social Science: Studies in Agent-Based Computational Modeling (Princeton Studies in Complexity) [3], Chapter 1, The Generativist Manifesto, we do not understand “emergence” if we cannot construct it using agent-based modeling.
• Interaction process analysis; a method for the study of small groups [4]: Social psychology, while seemingly too complicated to be modeled, has models

#### 1.2 Where are the probability distributions (Bernoulli trial, binomial, multinomial, order statistics, exponential, Poisson, Gaussian and more) from?

• Stochastic Simulation [5]: sampling techniques
• Simulation Modeling and Analysis [6]: sampling techniques
• Probabilities, Statistics and Data Modeling: An interesting e-book on where different probability distributions are originated, estimation and statistical testing, published by AI ACCESS (aiaccess@aol.com). Unfortunately this book is no longer available. I will share chapters later.

#### 1.3 What are the random processes we often encounter? How do we sample from them?

• Stochastic modelling for systems biology [1], chapter 6 on relating microscopic molecule-level interactions and macroscopic dynamics, chapter 8 on approximate algorithms.
• Financial Modeling with Markov Jump Processes [7], on simulating Markov jump processes.

### 2. Making Inferences with Markov Processes

#### 2.1 Exact inference with message passing and expectation maximization with applications to hidden Markov models and Kalman filter

• A new approach to linear filtering and prediction problems [8], this is where Kalman filter was first brought forth by Kalman
• Kalman filter on Wikipedia.
• Extended Kalman filter is first order approximation to the state transition kernel when it is non-linear. Unscented Kalman filter is reported to out-perform extended Kalman filter.
• Hidden Markov model: formulation, forward-backward method for latent state inference, EM algorithm for parameter estimation, Viterbi decoding algorithm to find maximum likelihood latent state path.

#### 2.2 Legendre transform, K-L divergence, with applications to Markov processes that defy exact inference

• Graphical Models, Exponential Families, and Variational Inference [9]: the best theoretical treatment of this topic, with loop belief propagation and expectation propagation discussed in chapter 4, mean field method in chapter 5 and parameter learning in chapter 6.
• Factorial Hidden Markov Models [10]: This is  where Ghahramani and Jordan first applied mean field method to make inferences with graphical models.
• Expectation Propogation for Approximate Inference in Dynamic Bayesian Networks [11]: duality between minimizing Bethe variation and minimizing negative log likelihood.
• Expectation Propagation for approximate Bayesian inference [12]: This is where EP was first brought forth by Minka.

#### 2.3 Markov chain Monte Carlo, with applications to Markov processes that defy exact inference

• Stochastic modelling for systems biology [1]: chapter 9 on Gibbs sampling and Metropolis-Hastings sampling, chapter 10 on sampling based inference of stochastic kinetic models.

### 3. Complex Systems identified by Markov Processes

#### 3.1 Evolution of social networks: random graph model, exponential random graph model, network formation game

• Exponential Random Graph Models for Social Networks [13]: ERGM simulation, inference, parameter learning, applications; Temporal ERGM.
• Game theory models of network formation: A strategic model of social and economic networks [14];The evolution of social and economic networks [15].
• statnet, a suite of software packages for network analysis that implement recent advances in the statistical modeling of networks.
• Physicists models on network evolution: On Random Graphs [16], On the Evolution of Random Graphs [17], Collective dynamics of ‘small-world’networks [18], Small worlds: the dynamics of networks between order and randomness [19], Emergence of scaling in random networks [20].
• Real world network formation: Reality Commons data set and a network tracked with personal mobile phones [21].

#### 3.2 Diffusion in social networks: opinion dynamics, culture dynamics & language dynamics

• Statistical Physics of Social Dynamics [22]: Opinion dynamics, formation of languages and cultures, etc.
• Interacting Particle Systems [23].
• Agent-Based Models (Quantitative Applications in Social Sciences) [24]
• NetLogo
• Diffusion in real world: The spread of obesity in a large social network over 32 years [25] …

#### 3.3 Urban dynamics: formation of cities, road transportation dynamics

• Cities and complexity: understanding cities with cellular automata, agent-based models, and fractals [26]
• Urban Dynamics [27]: book on the interaction between different industrial sections using system dynamics approach.
• TRANSIMS: Transportation analysis and simulation system [28]
• MATSIM-T: Aims, approach and implementation [29]

### References

[1] D. J. Wilkinson, Stochastic modelling for systems biology, Boca Raton, FL: Taylor & Francis, 2006.
[Bibtex]
@Book{Wilkinson_Stochastic_modelling_2006,
Title = {Stochastic modelling for systems biology},
Author = {Darren James. Wilkinson},
Publisher = {Taylor \& Francis},
Year = {2006},
Date-modified = {2012-08-11 18:32:17 +0000}
}
[2] A. Borshchev, The big book of simulation modeling: multimethod modeling with anylogic 6, AnyLogic North America, 2013.
[Bibtex]
@Book{borshchev2013big,
Title = {The Big Book of Simulation Modeling: Multimethod Modeling with AnyLogic 6},
Author = {Borshchev, Andrei},
Publisher = {AnyLogic North America},
Year = {2013},
Date-modified = {2015-09-14 04:04:43 +0000}
}
[3] J. M. M. Epstein, Generative social science: studies in agent-based computational modeling (princeton studies in complexity), Princeton University Press, 2007.
[Bibtex]
@Book{Epstein07GenerativeSocialScience,
Title = {Generative Social Science: Studies in Agent-Based Computational Modeling (Princeton Studies in Complexity)},
Author = {Joshua M. M. Epstein},
Publisher = {Princeton University Press},
Year = {2007},
Date-modified = {2012-07-23 15:05:32 +0000}
}
[4] R. F. Bales, “Interaction process analysis; a method for the study of small groups.,” , 1950.
[Bibtex]
@Article{bales1950interaction,
Title = {Interaction process analysis; a method for the study of small groups.},
Author = {Bales, Robert F},
Year = {1950},
}
[5] B. D. Ripley, Stochastic simulation, John Wiley & Sons, 2009, vol. 316.
[Bibtex]
@Book{ripley2009stochastic,
Title = {Stochastic simulation},
Author = {Ripley, Brian D},
Publisher = {John Wiley \& Sons},
Year = {2009},
Volume = {316}
}
[6] D. W. Kelton and A. M. Law, Simulation modeling and analysis, McGraw Hill Boston, 2000.
[Bibtex]
@Book{kelton2000simulation,
Title = {Simulation modeling and analysis},
Author = {Kelton, W David and Law, Averill M},
Publisher = {McGraw Hill Boston},
Year = {2000}
}
[7] P. Tankov, Financial modelling with jump processes, Boca Raton, Fla.: Chapman & Hall/CRC, 2004.
[Bibtex]
@Book{Cont_Financial_modelling_c2004,
Title = {Financial modelling with jump processes},
Author = {Peter Tankov},
Publisher = {Chapman \& Hall/CRC},
Year = {2004},
Date-modified = {2012-08-11 18:32:17 +0000}
}
[8] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of fluids engineering, vol. 82, iss. 1, pp. 35-45, 1960.
[Bibtex]
@Article{kalman1960new,
Title = {A new approach to linear filtering and prediction problems},
Author = {Kalman, Rudolph Emil},
Journal = {Journal of Fluids Engineering},
Year = {1960},
Number = {1},
Pages = {35--45},
Volume = {82},
Publisher = {American Society of Mechanical Engineers}
}
[9] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and trends in machine learning, vol. 1, iss. 1-2, pp. 1-305, 2008.
[Bibtex]
@Article{DBLP:journals/ftml/WainwrightJ08,
Title = {Graphical Models, Exponential Families, and Variational Inference},
Author = {Martin J. Wainwright and Michael I. Jordan},
Journal = {Foundations and Trends in Machine Learning},
Year = {2008},
Number = {1-2},
Pages = {1-305},
Volume = {1},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Date-modified = {2012-08-11 18:32:17 +0000},
Ee = {http://dx.doi.org/10.1561/2200000001}
}
[10] Z. Ghahramani and M. I. Jordan, “Factorial hidden markov models,” Machine learning, vol. 29, iss. 2-3, pp. 245-273, 1997.
[Bibtex]
@Article{DBLP:journals/ml/GhahramaniJ97,
Title = {Factorial Hidden Markov Models},
Author = {Zoubin Ghahramani and Michael I. Jordan},
Journal = {Machine Learning},
Year = {1997},
Number = {2-3},
Pages = {245-273},
Volume = {29},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Date-modified = {2012-08-11 18:32:17 +0000},
Ee = {http://dx.doi.org/10.1023/A:1007425814087}
}
[11] T. Heskes and O. Zoeter, “Expectation propogation for approximate inference in dynamic bayesian networks,” in Uai, 2002, pp. 216-223.
[Bibtex]
@InProceedings{DBLP:conf/uai/HeskesZ02,
Title = {Expectation Propogation for Approximate Inference in Dynamic Bayesian Networks},
Author = {Tom Heskes and Onno Zoeter},
Booktitle = {UAI},
Year = {2002},
Pages = {216-223},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Crossref = {DBLP:conf/uai/2002},
Date-modified = {2012-10-06 08:56:47 +0000},
Ee = {http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1{\&}smnu=2{\&}article_id=863{\&}proceeding_id=18}
}
[12] T. P. Minka, “Expectation propagation for approximate bayesian inference,” in Uai, 2001, pp. 362-369.
[Bibtex]
@InProceedings{DBLP:conf/uai/Minka01,
Title = {Expectation Propagation for approximate Bayesian inference},
Author = {Thomas P. Minka},
Booktitle = {UAI},
Year = {2001},
Pages = {362-369},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Crossref = {DBLP:conf/uai/2001},
Date-modified = {2012-10-06 08:54:23 +0000},
Ee = {http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1{\&}smnu=2{\&}article_id=120{\&}proceeding_id=17}
}
[13] G. Robins, “Exponential random graph models for social networks,” Handbook of social network analysis. sage, 2011.
[Bibtex]
@article{robins2011exponential,
title={Exponential random graph models for social networks},
author={Robins, Garry},
journal={Handbook of Social Network Analysis. Sage},
year={2011},
publisher={Citeseer}
}
[14] M. O. Jackson and A. Wolinsky, “A strategic model of social and economic networks,” Journal of economic theory, vol. 71, iss. 1, pp. 44-74, 1996.
[Bibtex]
@article{jackson1996strategic,
title={A strategic model of social and economic networks},
author={Jackson, Matthew O and Wolinsky, Asher},
journal={Journal of economic theory},
volume={71},
number={1},
pages={44--74},
year={1996},
publisher={Elsevier}
}
[15] M. O. Jackson and A. Watts, “The evolution of social and economic networks,” Journal of economic theory, vol. 106, iss. 2, pp. 265-295, 2002.
[Bibtex]
@article{jackson2002evolution,
title={The evolution of social and economic networks},
author={Jackson, Matthew O and Watts, Alison},
journal={Journal of Economic Theory},
volume={106},
number={2},
pages={265--295},
year={2002},
publisher={Elsevier}
}
[16] P. ERDdS and A. R&WI, “On random graphs i,” Publ. math. debrecen, vol. 6, pp. 290-297, 1959.
[Bibtex]
@article{erdds1959random,
title={On random graphs I},
author={ERDdS, P and R\&WI, A},
journal={Publ. Math. Debrecen},
volume={6},
pages={290--297},
year={1959}
}
[17] P. Erd6s and A. Rényi, “On the evolution of random graphs,” Publ. math. inst. hungar. acad. sci, vol. 5, pp. 17-61, 1960.
[Bibtex]
@article{erd6s1960evolution,
title={On the evolution of random graphs},
author={Erd6s, Paul and R{\'e}nyi, A},
journal={Publ. Math. Inst. Hungar. Acad. Sci},
volume={5},
pages={17--61},
year={1960},
publisher={Citeseer}
}
[18] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’networks,” Nature, vol. 393, iss. 6684, pp. 440-442, 1998.
[Bibtex]
@article{watts1998collective,
title={Collective dynamics of ‘small-world’networks},
author={Watts, Duncan J and Strogatz, Steven H},
journal={nature},
volume={393},
number={6684},
pages={440--442},
year={1998},
publisher={Nature Publishing Group}
}
[19] D. J. Watts, Small worlds: the dynamics of networks between order and randomness, Princeton university press, 1999.
[Bibtex]
@book{watts1999small,
title={Small worlds: the dynamics of networks between order and randomness},
author={Watts, Duncan J},
year={1999},
publisher={Princeton university press}
}
[20] A. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, iss. 5439, pp. 509-512, 1999.
[Bibtex]
@article{barabasi1999emergence,
title={Emergence of scaling in random networks},
author={Barab{\'a}si, Albert-L{\'a}szl{\'o} and Albert, R{\'e}ka},
journal={science},
volume={286},
number={5439},
pages={509--512},
year={1999},
publisher={American Association for the Advancement of Science}
}
[21] W. Dong, B. Lepri, and A. Pentland, “Modeling the co-evolution of behaviors and social relationships using mobile phone data,” in Mum, 2011, pp. 134-143.
[Bibtex]
@InProceedings{DBLP:conf/mum/DongLP11,
Title = {Modeling the co-evolution of behaviors and social relationships using mobile phone data},
Author = {Wen Dong and Bruno Lepri and Alex Pentland},
Booktitle = {MUM},
Year = {2011},
Pages = {134-143},
Bibsource = {DBLP, http://dblp.uni-trier.de},
Crossref = {DBLP:conf/mum/2011},
Date-modified = {2012-08-11 18:32:17 +0000},
Ee = {http://doi.acm.org/10.1145/2107596.2107613}
}
[22] C. Castellano, S. Fortunato, and V. Loreto, “Statistical physics of social dynamics,” Reviews of modern physics, vol. 81, iss. 2, pp. 591-646, 2009.
[Bibtex]
@Article{Castellano09SocialDynamics,
Title = {Statistical physics of social dynamics},
Author = {Claudio Castellano and Santo Fortunato and Vittorio Loreto},
Journal = {Reviews of Modern Physics},
Year = {2009},
Number = {2},
Pages = {591-646},
Volume = {81},
Date-modified = {2012-07-30 12:24:17 +0000}
}
[23] T. Liggett, Interacting particle systems, Springer Science & Business Media, 2012, vol. 276.
[Bibtex]
@book{liggett2012interacting,
title={Interacting particle systems},
author={Liggett, Thomas},
volume={276},
year={2012},
}
[24] N. Gilbert, Agent-based models (quantitative applications in social sciences), Sage Publications, Inc., 2007.
[Bibtex]
@Book{Nigel07ABM,
Title = {Agent-Based Models (Quantitative Applications in Social Sciences)},
Author = {Nigel Gilbert},
Publisher = {Sage Publications, Inc.},
Year = {2007},
Date-modified = {2012-07-30 15:33:34 +0000}
}
[25] N. A. Christakis and J. H. Fowler, “The spread of obesity in a large social network over 32 years,” New england journal of medicine, vol. 357, iss. 4, pp. 370-379, 2007.
[Bibtex]
@article{christakis2007spread,
title={The spread of obesity in a large social network over 32 years},
author={Christakis, Nicholas A and Fowler, James H},
journal={New England journal of medicine},
volume={357},
number={4},
pages={370--379},
year={2007},
publisher={Mass Medical Soc}
}
[26] M. Batty, Cities and complexity: understanding cities with cellular automata, agent-based models, and fractals, Mit Press, 2007.
[Bibtex]
@Book{batty2007cities,
Title = {Cities and Complexity: Understanding Cities With Cellular Automata, Agent-Based Models, and Fractals},
Author = {Batty, M.},
Publisher = {Mit Press},
Year = {2007},
Date-modified = {2012-07-24 01:21:51 +0000},
ISBN = {9780262524797},
}
[27] J. W. Forrester, Urban dynamics, Massachusetts Institute of Technology, 1970, vol. 11.
[Bibtex]
@Book{forrester1970urban,
Title = {Urban dynamics},
Author = {Forrester, Jay W},
Publisher = {Massachusetts Institute of Technology},
Year = {1970},
Number = {3},
Volume = {11},
Date-modified = {2015-10-08 14:53:42 +0000},
Journal = {IMR; Industrial Management Review (pre-1986)},
Keywords = {modeling},
Pages = {67}
}
[28] L. Smith, R. Beckman, and K. Baggerly, “Transims: transportation analysis and simulation system,” Los Alamos National Lab., NM (United States) 1995.
[Bibtex]
@TechReport{smith1995transims,
Title = {TRANSIMS: Transportation analysis and simulation system},
Author = {Smith, Laron and Beckman, Richard and Baggerly, Keith},
Institution = {Los Alamos National Lab., NM (United States)},
Year = {1995},
Date-modified = {2015-09-14 04:12:22 +0000}
}
[29] MATSim development team (ed.), “MATSIM-T: Aims, approach and implementation,” {IVT, ETH Zürich, Zürich} 2007.
[Bibtex]
@TechReport{matsim,
Title = {{MATSIM-T: Aims, approach and implementation}},
Author = {{MATSim development team (ed.)}},
Institution = {{IVT, ETH Z\"urich, Z\"urich}},
Year = {2007},
Date-modified = {2015-06-29 21:08:45 +0000}
}

# Problem Set 1 Solution

### Solution to Problem 1

1.1 There are ${6 \choose 1}=1$ event with 0 outcomes (also called atomic event), ${6 \choose 1}=6$ events with 1 outcomes, ${6 \choose 2}=15$ events with 2 outcomes, …, ${6 \choose 6}=1$ event with all 6 outcomes. The total number of events are $\sum_{i=0}^{6}{6 \choose i}=2^{6}=64$ events. The events with 2 outcomes are for instance “the outcome is either 1 or 2”.

1.2 The events generated by $X$ are $\left\{ \emptyset,\{\omega:X(\omega)=0\},\{\omega:X(\omega)=1\},\Omega\right\}$.
In other words, they are $\left\{ \emptyset,\{1,2\},\{3,4,5,6\},\{1,2,3,4,5,6\}\right\}$.

The events generated by $Y$ are $\left\{ \emptyset,\{\omega:Y(\omega)=0\},\{\omega:Y(\omega)=1\},\Omega\right\}$. In other words, they are $\left\{ \emptyset,\{2,3,5\},\{1,4,6\},\{1,2,3,4,5,6\}\right\}$.

The events generated by $X$ and $Y$ jointly are the events “generated” by the following 4 disjoint events: $A=\{\omega:X(\omega)=0\&Y(\omega)=0\}=\{2\}$, $B=\{\omega:X(\omega)=0\&Y(\omega)=1\}=\{1\}$, $C=\{\omega:X(\omega)=1\&Y(\omega)=0\}=\{3,5\}$ and $D=\{\omega:X(\omega)=1\&Y(\omega)=1\}=\{4,6\}$. There are $2^{4}=16$ of them: $\emptyset$, $A$, $B$, $C$, $D$, $A\cup B$, $A\cup C$, $A\cup D$, $B\cup C$, $B\cup D$, $C\cup D$, $A\cup B\cup C$, $A\cup B\cup D$, $A\cup C\cup D$, $B\cup C\cup D$, $\Omega$.

Since every outcomes have a equal probability of $\frac{1}{6}$. The probabilities associated with the events can be computed in terms of the number of outcomes in the events. For example $P(A)=\frac{1}{6}$, $P(B)=\frac{1}{6}$, $P(C)=\frac{2}{6}$ and $P(D)=\frac{2}{6}$.

The mean of $X$ is $\mbox{E}X=0\cdot P(X=0)+1\cdot P(X=1)=\frac{4}{6}$. The variance of $X$ is $(0-\mbox{E}X)^{2}\cdot P(X=0)+(1-\mbox{E}X)^{2}\cdot P(X=1)=P(X=0)P(X=1)=\frac{8}{36}$. The mean of $Y$ is $\mbox{E}Y=0\cdot P(Y=0)+1\cdot P(Y=1)=\frac{3}{6}$. The variance of $Y$ is $\frac{9}{36}$.

The events defined by $XY$ are “generated” by $\{\omega:(XY)(\omega)=0\}$ and $\{\omega:(XY)(\omega)=1\}=\{\omega:X(\omega)=1\}\cap\{\omega:Y(\omega)=1\}=\{4,6\}$, with probability $P(\{\omega:(XY)(\omega)=1\})=P(\{4,6\})=\frac{2}{6}$. The mean of $XY$ is hence $\frac{2}{6}$ and the variance is hence $\frac{8}{36}$.

The random variables $X$ and $Y$ are dependent, because $P(X,Y)\ne P(X)P(Y)$, where $(X,Y)\in\{(0,0),(0,1),(1,0),(1,1)\}$.

1.3 Conditional expectations are random variables, and we can work out the events defined by the conditional expectations as we did previously. $\left(\mbox{E}(A|{\cal F}_{X})\right)(\omega)=\sum_{i\in\mbox{range}(X)}A(\omega)P(X(\omega)=i)$.

\begin{eqnarray*}
& & \begin{array}{c|cccccc}
\omega & 1 & 2 & 3 & 4 & 5 & 6\\
\hline I(\omega) & 1 & 2 & 3 & 4 & 5 & 6\\
\left(\mbox{E}(I|{\cal F}_{X})\right)(\omega) & \frac{2}{6} & \frac{2}{6} & \frac{4}{6} & \frac{4}{6} & \frac{4}{6} & \frac{4}{6}\\
\left(\mbox{E}(I|{\cal F}_{X})\right)(\omega) & \frac{3}{6} & \frac{3}{6} & \frac{3}{6} & \frac{3}{6} & \frac{3}{6} & \frac{3}{6}\\
\left(\mbox{E}(I|{\cal F}_{X,Y})\right)(\omega) & \frac{1}{6} & \frac{1}{6} & \frac{2}{6} & \frac{2}{6} & \frac{2}{6} & \frac{2}{6}
\end{array}
\end{eqnarray*}

### Solution to Problem 2

2.1 Let us count the number of success in a unit time, if we conduct Bernoulli trials every $\Delta$ time repeatedly and the probability of success is $\lambda\Delta$. It is a binomial distribution $B(\frac{1}{\Delta},\lambda\Delta)$. The probability of $k$ successes is $p(k;\frac{1}{\Delta},\lambda\Delta)={\frac{1}{\Delta} \choose k}\left(\lambda\Delta\right)^{k}\left(1-\lambda\Delta\right)^{\frac{1}{\Delta}-k}$. Taking the limit
\begin{eqnarray*}
\lim_{\Delta\to0}p(k;\frac{1}{\Delta},\lambda\Delta) & = & \lim_{\Delta\to0}{\frac{1}{\Delta} \choose k}\left(\lambda\Delta\right)^{k}\left(1-\lambda\Delta\right)^{\frac{1}{\Delta}-k}\\
& = & \lim_{\Delta\to0}\frac{\frac{1}{\Delta}(\frac{1}{\Delta}-1)\cdots(\frac{1}{\Delta}-k+1)\cdot\Delta^{k}}{k!}\lambda^{k}\exp(-\lambda)\\
& = & \frac{1}{k!}\lambda^{k}\exp(-\lambda).
\end{eqnarray*}
We get a Poisson distribution with rate $\lambda$.

2.2 Let $X$ have exponential distribution with rate $\lambda$. In other words , $X$ has probability density function $f(x)=\lambda\exp(-\lambda x)$ and cumulative probability function $P(\{X\le x\})=1-\exp(-\lambda x)$. It follows that
\begin{eqnarray*}
P(\{X\ge s+t\}) & = & \exp(-\lambda(s+t))\\
& = & \exp(-\lambda s)\cdot\exp(-\lambda t)\\
& = & P(\{X\ge s\})\cdot P(\{X\ge t\}).
\end{eqnarray*}
Let $X$ have geometric distribution, with cumulative probability function $P(\{X\le k\})=1-(1-p)^{k}$. It follows that
\begin{eqnarray*}
P(\{X\ge s+t\}) & = & (1-p)^{s+t}\\
& = & (1-p)^{s}(1-p)^{t}\\
& = & P(\{X\ge s\})\cdot P(\{X\ge t\}).
\end{eqnarray*}

In fact, the memory property can also be understand in terms of how we sample the time to the next success — We do not refer to the history in the sampling process.

2.3 We first construct the Lagrange function $L=-\sum_{i=1}^{\infty}p_{i}\log p_{i}+\lambda_{0}(\sum_{i=1}^{\infty}p_{i}-1)+\lambda_{1}(\sum_{i=1}^{\infty}i\cdot p_{i}-\mu)$. The Lagrange function is continuous with the parameters $p_{i}$. Taking the partial derivative of the Lagrange function over the parameters, we get
\begin{eqnarray*}
& & \frac{\partial L}{\partial p_{i}}=-1-\log p_{i}+\lambda_{0}+i\lambda_{1}\stackrel{\mbox{set}}{=}0\\
& \Rightarrow & p_{i}=\exp\left(1-\lambda_{0}-i\lambda_{1}\right)=\exp(1-\lambda_{0})\exp(-\lambda_{1}\cdot i).\\
& & \sum_{i=1}^{\infty}p_{i}=\exp(1-\lambda_{0})\sum_{i=1}^{\infty}\exp(-\lambda_{1}\cdot i)\stackrel{\mbox{set}}{=}1\\
& \Rightarrow & \exp(1-\lambda_{0})\frac{\exp(-\lambda_{1})}{1-\exp(-\lambda_{1})}=1\Rightarrow\exp(1-\lambda_{0})=\exp(\lambda_{1})-1\\
& & \sum_{i=1}^{\infty}i\cdot p_{i}\stackrel{\mbox{set}}{=}\left(\exp(\lambda_{1})-1\right)\sum_{i=1}^{\infty}i\cdot\exp(-\lambda_{1}\cdot i)\\
& & =\left(\exp(\lambda_{1})-1\right)\frac{\partial}{\partial\lambda_{1}}\sum_{i=1}^{\infty}\exp(-\lambda_{1}\cdot i)\\
& & =\left(\exp(\lambda_{1})-1\right)\frac{\partial}{\partial\lambda_{1}}\frac{\exp(-\lambda_{1})}{1-\exp(-\lambda_{1})}=-\frac{1}{\exp(\lambda_{1})-1}\stackrel{\mbox{set}}{=}\mu.
\end{eqnarray*}
Hence $p_{i}=(1-\frac{1}{\mu})^{i-1}\frac{1}{\mu}$.

2.4 We first express the cumulative probability function of a Gamma distribution in terms a Poisson distribution $P(X\le x)=1-\sum_{i=0}^{k-1}\frac{(\lambda x)^{k}}{k!}\exp(-\lambda x)$, then take the partial derivative over “time” to get the probability density function: $\frac{d}{dx}P(X\le x)=\frac{\lambda}{(k-1)!}(\lambda x)^{k-1}\exp(-\lambda x)$
.

### Solution to Problem 3

3.1 The moment generating function of a random variable $X$ is $\mbox{M}(t)=\mbox{E}\left(\exp(tX)\right)=\int dx\cdot p(x)\exp(t\cdot x)$. We prove the moment generating property by induction. Its $0^{\mbox{th}}$ order derivative at $t=0$ (i.e., its value at 0) is $M(0)=\int dx\cdot p(x)\exp(0\cdot x)=1$. Its $1^{\mbox{st}}$ order derivative is
\begin{eqnarray*}
\frac{\partial}{\partial t}\mbox{M}(t) & = & \frac{\partial}{\partial t}\int dx\cdot p(x)\exp(t\cdot x)\\
& = & \int dx\cdot p(x)\frac{\partial}{\partial t}\exp(t\cdot x)\mbox{ , with mild regularity condition}\\
& = & \int dx\cdot p(x)\cdot x\exp(t\cdot x).
\end{eqnarray*}
Hence $\frac{\partial}{\partial t}\mbox{M}(t)|_{t=0}=\int dx\cdot p(x)\cdot x\exp(0\cdot x)=\int dx\cdot p(x)\cdot x=\mbox{E}(x)$.
Now suppose $\frac{\partial^{n}}{\partial t^{n}}\mbox{M}(t)=\int dx\cdot p(x)\cdot x^{n}\exp(t\cdot x)$,
it follows that $\frac{\partial^{n}}{\partial t^{n}}\mbox{M}(t)|_{t=0}=\mbox{E}(X^{n})$,
and
\begin{eqnarray*}
\frac{\partial^{n+1}}{\partial t^{n+1}}\mbox{M}(t) & = & \frac{\partial}{\partial t}\left(\frac{\partial^{n}}{\partial t^{n}}\mbox{M}(t)\right)\\
& = & \frac{\partial}{\partial t}\int dx\cdot p(x)\cdot x^{n}\exp(t\cdot x)\\
& = & \int dx\cdot p(x)\cdot x^{n}\frac{\partial}{\partial t}\exp(t\cdot x)\mbox{, with mild regularity condition}\\
& = & \int dx\cdot p(x)\cdot x^{n+1}\exp(t\cdot x).
\end{eqnarray*}
Hence $\frac{\partial^{n+1}}{\partial t^{n+1}}\mbox{M}(t)|_{t=0}=\mbox{E}(X^{n+1})$.

Given random variable $X$ and its moment generating function $\mbox{M}_{X}(t)$, the linear transform of the random variable is $\alpha X+\beta$, where $\alpha$ and $\beta$ are real numbers (or vectors). The moment generating function by definition is
\begin{eqnarray*}
\mbox{M}_{\alpha X+\beta}(t) & = & \mbox{E}\left(\exp\left(t\cdot(\alpha x+\beta)\right)\right)\\
& = & \mbox{E}\left(\exp(\alpha t\cdot x)\exp(t\cdot\beta)\right)\\
& = & \mbox{E}\left(\exp(\alpha t\cdot x)\right)\cdot\exp(t\cdot\beta)\mbox{, linearity of expectation}\\
& = & \mbox{M}_{X}\left(\alpha t\right)\cdot\exp(t\cdot\beta).
\end{eqnarray*}

Given indenpendent random variables $X_{1},\cdots,X_{n}$ with moment generating functions $\mbox{M}_{1}(t),\cdots,\mbox{M}_{n}(t)$. The moment generating function of the sum is
\begin{eqnarray*}
\mbox{M}_{\sum_{i=1}^{n}X_{i}}(t) & = & \mbox{E}\left(\exp(t\cdot\sum_{i=1}^{n}X_{i})\right)\\
& = & \mbox{E}\left(\prod_{i=1}^{n}\exp(t\cdot X_{i})\right)\\
& = & \prod_{i=1}^{n}\mbox{E}\left(\exp(t\cdot X_{i})\right)\mbox{, functions of independent r.v.s are independent}\\
& & \prod_{i=1}^{n}\mbox{M}_{X_{i}}\left(t\right).
\end{eqnarray*}
3.2 The moment generating function of linear transform is $\mbox{M}_{\frac{X_{i}-\mu}{\sigma}}(t)=\mbox{M}_{X_{i}}(\frac{t}{\sigma})\cdot\exp(-\frac{\mu}{\sigma}\cdot t)$.
The moment generating function of the sum is $\mbox{M}_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}}(t)=\exp\left(-\frac{\sqrt{n}\mu}{\sigma}\cdot t\right)\prod_{i=1}^{n}M_{X_{i}}(\frac{t}{\sqrt{n}\sigma})$.

3.3
\begin{eqnarray*}
\mbox{M}_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}}(t) & = & \prod_{i=1}^{n}M_{\frac{X_{i}-\mu}{\sigma}}(\frac{t}{\sqrt{n}})\mbox{, m.g.f. of linear transformation}\\
& = & \prod_{i=1}^{n}\left(1+\frac{t}{\sqrt{n}}\cdot0+\frac{t^{2}}{2!\cdot n}1+\frac{1}{\sqrt{n}}\xi\right)\mbox{, Taylor expansion}\\
& = & \left(1+\frac{t^{2}}{2n}+\frac{1}{\sqrt{n}}\xi\right)^{n},\\
\lim_{n\to\infty}\mbox{M}_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}}(t) & = & \frac{t^{2}}{2}\mbox{, L’Hospital’s Rule.}
\end{eqnarray*}

Because the moment generating functions of the series of random variables converge to that of the standard normal variable, and each moment generating function uniquely , it follows that the series of random variables converges to one with standard normal distribution.

# Exponential Family, divergence and convex optimization [09/24/2015]

prepared by Hao Fan, Manu Kothegal Shivakumar, and Pritika Mehta.

### Maximum entropy principle

Entropy is a measure of unpredictability of information content.  The Shannon entropy $Η$ of a discrete random variable $X$ with possible values $\{x_1, \cdots, x_n\}$ and probability mass function $P(X)$ is defined as:

$H(X)=E[I(X)]=E[-\log(P(X))]$

Here $E$ is the expected value operator, and $I$ is the information content of $X$. $I(X)$ is itself a random variable. (Wikipedia, Entropy (information theory), 2015)

In a discrete case, the entropy is the sum over the logarithm of the probability mass function.  Here the weight is set to be the probability for us to get a specific value. In other words a Shannon entropy is the expected value of the log density or the log probability of mass and also take the minus sign, it can explicitly be written as

$H(X)=∑_iP(x_i)I(x_i)=-∑_iP(x_i)logP(x_i)$

When the distribution is continuous rather than discrete, the sum is replaced with an integral as

$H(X)=∫P(x)I(x)dx=-∫P(x)logP(x)dx$

where $P(x)$ represents a probability density function.

It can also be described as

$H(X) = \int_{-\infty}^{\infty}f(x)logf(x)dx$ if $X ∈ R$ and $\frac{dP(x)}{d\mu(x)} = f(x)$

where $\mu$ is Lebesgue measure.

The concept of maximum entropy principle is that, subject to precisely stated prior, the probability distribution which best represents the current state of knowledge is the one with largest entropy.

Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. Of those, the one with maximal information entropy is the proper distribution, according to the Principle of maximum entropy.

One question is why do we maximize the entropy, or why the probability distribution with the maximum entropy is the distribution with the least amount of information? The following example can help explain this.

If we throw a dice, a dice has six faces and the distribution among six faces with the maximum entropy is a uniform distribution. The probability distribution with minimum entropy as the distribution that we get some number one or two or three and so on and so forth. With a distribution, with the minimum, with zero entropy meaning that we know the result of the throw. In that way we do not have to guess and we have a distribution with zero entropy, meaning zero uncertainty there. Uniform distribution of all six faces as a distribution with maximum entropy meaning that, the expected value of the logarithm of the probability mass function is maximized and negative is maximized. So we can see by maximizing entropy, we minimizes prior information built into the distribution. And also many physical systems tend to move towards maximal entropy configurations over time.

Maximum entropy is constantly related to a family of distributions which is called the exponential distribution. According to the theory of exponential family, the distribution which maximizes entropy subject to constraints on moments lies in the exponential family with sufficient statistics corresponding to these moments.

We have already talked about many distributions such as geometric distribution, exponential distribution, normal distribution, uniform distribution.  Their relations with maximal entropy is as below.

• Geometric distribution: maximizes entropy given mean on natural numbers
• Exponential distribution: maximizes entropy given mean on positive real numbers
• Normal distribution: maximizes entropy given mean and stand deviation
• Uniform distribution: maximizes entropy on the support

A generalization of these distributions is is the Boltzmann distribution, which is from the statistic physics. This distribution has an invariance and is called n statistics or n features. N statistic is something that we can extract from our sample like we sum our samples together or we multiply our samples together or we take the maximum value or we take the minimum value.

Here the invariance is the expected value of those features and here we use f to denote the features and there are N features. In this distribution, the average value of our features is given and we want to maximize the entropy of this distribution with a constraint of the invariance. The format is as below.

Given n invariants $E(f_i(X)) = a_i$ for $i = 1,2,\cdots,n$

• $p.m.f.p_k=c\cdot exp(\sum_{i=1}^n\lambda_i f_i(k))$ for $X = 1,2,\cdots$
• $p.d.f.p_k=c \cdot (\sum_{i=1}^n\lambda_i f_i(k))$ for $X∈S∈R$
• $c$ is partition function

Suppose that the support of this random variable meaning that the possible values of this random variable only takes a finite number of values from $1$, $2$, $3$, $4$, to some $K$. Then the probability mass function as we can show has some exponential form. We can look this as a dot product with those $\lambda$ and those parameters and those $f$ in a form of a dot product. If $X$ the value of this random variable is on the real numbers or on the high dimensional space of Euclidean space, then the probability density function is also the exponential form.

The proof is through the method of Lagrange multipliers.

There is

$J(P)=-\sum_{1}^k p_k logp_k=\mu(\sum_{1}^k p_k – 1)=\sum_{i=1}^n \lambda_i (\sum_{1}^k p_k f_i(k)-\alpha_n)$

Take derivative with respect to  and set it to 0, we have

$-1-logp_k+\mu+\sum_{i=1}^n\lambda_i f_i(k)=0$

The solution is

$p_k=\frac{exp(\sum_{i=1}^n\lambda_i f_i(k))}{exp(1-\mu)}=c\cdot exp(\sum_{i=1}^n\lambda_i f_i(k))$

Where

$c=\frac{1}{exp(1-\mu)}$

The next thing that we talk about is called Kullback-Leibler Divergence. Kullback–Leibler divergence (also known as information divergence, information gain, relative entropy, KLIC, or KL divergence) is a non-symmetric measure of the difference between two probability distributions $P$ and $Q$. Specifically, the Kullback–Leibler divergence of $Q$ from $P$, denoted , is a measure of the information lost when $Q$ is used to approximate P.

For category distribution. The KL divergence is as below

$D_{KL}(P || Q)=\sum_{i=1}^n P(i)log\frac{P(i)}{Q(i)}$

which is the probability of mass at a category I, probability of mass of P on category I over the probability of mass of Q on category I taking the logarithm and sum all of those quantities together with using the weight of probability P and those categories.

One thing we note is that if at some time $Q=0$ but $P\ne 0$ then we have some infinity. It’s something that we cannot use and similarly if P is 0 and $Q\ne 0$ at some category then we get something that is 0. For this KL divergence over category distributions to make sense we have to specify that the Q, whenever $P\ne 0$, Q shouldn’t be 0, meaning that if $Q=0$, P has to be 0. This is a condition that P is absolutely continuous with respect to Q.

For the continuous version suppose that the P and Q are supported by a real number then

$D_{KL}(P || Q)=\int dPlog\frac{dP}{dQ}$

To help us understand, we can consider as a distribution of, for example, on a real line we can have probability over intervals and on a two dimension space we can have probability over squares. Then over any interval we can find the probability of P over this interval. We can also find the probability Q over this interval. Then we divide the probability of P over this interval by the probability of Q over this interval.

Another thing we note is that if we compare this one with the entropy, we can find it is pretty similar to the entropy except that Q has uniform distribution on its support. So from there if we add or subtract some constant from the KL divergence of P over some uniform distribution of Q, we get to the negative entropy.

Reference

### Expectation Maximization from minimizing K-L Divergence

The Expectation Maximization algorithm can be used to solve the maximum likelihood parameter estimation problem for a model with hidden/latent variables. The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps:

• Expectation step: the (missing) data are estimated given the observed data and current estimates of model parameters
• Maximization step: The likelihood function is maximized under the assumption that the (missing) data are known

Let $Y$ denote the observable variables and let $X$ denote the latent variables. $Y = (Y_1,Y_2,\cdots,Y_n)$ where $Y_i=(y^{(1)}_i,y^{(2)}_i,y^{(3)}_i,\cdots,y^{(n)}_i)$ represent the observed values of $Y$. The probability model would be $p(y,x|\theta)$. If $x$ was also observed we could easily find out $\theta$ by maximizing $\log(y,x|\theta)$.

The log-likelihood of observing $y$ is $\log p(y|\theta) = \log \sum_x p(y,x|\theta)$. As $p(y,x|\theta)$ depends on $\theta$ and is unknown at the time when we estimate $\theta$, we average over $X$ using auxiliary distribution $q(x|y)$, and the expected log likelihood becomes $\sum_x q(x|y,\theta) \log p(y,z|\theta)$.

If $q$ is chosen carefully, the expected log likelihood won’t be far from the actual log likelihood and can serve as a effective surrogate. Even though it does not promise to yield a value of ${\theta}$ that maximizes the likelihood but it does promise improvement of ${\theta}$ with each iteration. This is the basic idea behind Expectation Maximization Algorithm.

We will begin the derivation of Expectation Maximization by showing that the auxiliary distribution can be used to provide the lower bound.

#### E-Step

$$\begin{eqnarray*} \log p(y|\theta) & = & \log\sum\limits _{x}\frac{q(x)\times p(y,x|\theta)}{q(x)}\\ & \ge & \mbox{E}_{q}[\log\frac{p(y,x|\theta)}{q(x)}]\mbox{ (Jensen’sinequality)}\\ & = & \mbox{E}_{q}[\log\frac{p(x|y)\times P(y|\theta)}{q(x)}]\\ & = & \mbox{E}_{q}[\log p(y|\theta)]-\mbox{E}_{q}[\log\frac{q(x)}{P(x|y,\theta))}]\\ & = & \mbox{E}_{q}[\log p(y|\theta)]-\mbox{D}_{\mbox{KL}}(q(x)||P(x|y,\theta)). \end{eqnarray*}$$

In the above equation we have simply shown that for any distribution $q(y|x)$, $L(q,\theta)$ is a lower bound on the log-likelihood using the Jensen’s Inequality(explained later on).Since the first term in the above equation does not depend upon $q(x)$, we just need to minimize the second term which is the KL divergence. Gibb’s distribution states that KL divergence is non-negative and is zero only when both of the distributions are identical which gives $q(x) = P(x|y,\theta)$.

#### M-Step

The M-step then consists of maximizing this lower bound with respect to the parameters. Returning to our lower bound

$$\begin{eqnarray*} & & \log(P(y|\theta)))\\ & \ge & \mbox{E}_{q}[\log(P(y,x|\theta)/q(x)]\\ & = & \mbox{E}_{q}[\log(P(y;x|\theta)]-\mbox{E}_{q}[\log(q(x))]\\ & = & \mbox{E}_{q}[\log(P(y;x|\theta)]+H(q(x)). \end{eqnarray*}$$

Here the second entropy term does not depend on $\theta$, so we simply set $\theta =\mbox{argmax} \mbox{E}_q[log(P(x,y|\theta)]$. That is, we wish to marginalize the expectation of the complete-data log-likelihood with respect to our auxiliary distribution $q(y)$.

Jensen’s inequality:

The intuitive explanation of Jensen’s inequality would be that it permits a comparison between the expectation of a transformation g of a random variable $X$ and this transformation g of the expectation of this random variable $X$.

In the case of g being linear, the two notions are the same — this is the linearity of expectation, where both values are (in the diagram) the y-coordinate of $M$.

In the case of g convex, we compare the image of $X$ under g with its image under the linear approximation of g at the expectation — it is very intuitive that convex functions are uniformly greater than their linear approximations (which are tangent lines, or more generally tangent hyper-planes). Since, as stated, the expectation of this linear function occurs at the expectation of the variable itself, it is also very intuitive that the expectation of the convex image of the variable is greater than the function at the expectation — the red curve biases the image of the variable to be on average greater than the y-coordinate of $M$.

Likewise, in the case of g concave, wherein it is smaller than any linear approximation, we expect the concave image of the random variable to be smaller than the image of its expectation under the concave function g. By the diagram, we would expect this concave curve to bias the image of the random variable below the y-coordinate of M.

In formulas: For $g$ linear, $\mbox{E}(g(X))=g(\mbox E(X))$. For $g$ convex: $\mbox E(g(X)) \ge(\mbox E(X))$. For g concave, $\mbox E(g(X)) \le g(\mbox E(X))$.

Example: Assume we play a game – roll a dice and you win dollars equal to the square of the number of dots (say $X$) that show up. Then $X$ is a random number and $f(X) = X^2$, is the payoff. What is the expected payoff? $\mbox E[f(X)] = (1 + 4 + 9 + 16 + 25 + 36)/6 = 15.17$. What is the expected outcome of the dice? $\mbox E[X] = (1+2+\cdots +6)/6=3.5$. What is the payoff for the expected outcome? $f(\mbox E[X])=12.25$. Clearly, $\mbox E[f(X)] > f(\mbox E[X])$.

Now if we change the game such that the payoff is twice the outcome of the dice, then $f(X) = 2X$, then $\mbox E[f(X)] = (1+2+3+..6)*2/6 = 7$ and $f(\mbox E[X]) = 2*3.5 = 7$. Clearly, $\mbox E[f(X)] = f(\mbox E[X])$.

Convexity in the payoff for the first game [X^2] is responsible for the increased payoff. The linear payoff in the second game ensures that $\mbox E[f(X)] = f(\mbox E[X])$. Had the payoff been concave, then we could have noticed $\mbox E[f(X)] < f(\mbox E[X])$. The above point can be understood mathematically by performing a taylor expansion of $f$ around the mean of $X$. The second order term explains the behavior with respect to convex($f”(X)> 0$), concave ($f”(X)< 0$), and linear ($f”(X)=0$) functions.

#### References

• http://www.david-andrzejewski.com/publications/em.pdf
• An introduction to probabilistic graphical models (Michael I Jordan)

### Convex Optimization

Legendre transform or convex conjugate for a given function f(x) has another variable y

$$f^{*}(y) = supp_{x} (y^{T}x – f(x))$$
where supp is supremum.

Why do we care about convex conjugate?
1. The convex conjugate function $f^{*}(y)$ is a convex function even if f(x) is not convex.
2. The convex conjugate of the convex conjugate is also a convex function.

The Legendre Transformation $f^{*}$ is an encoding of a function f’s supporting hyperplane.

Consider function h(x) = <a,x> + b be set of all lines that are below the convex function shown in above picture($h(x)$ includes lines that don’t touch the convex function as well, as shown in figure above). Now $f(x) >= h(x)$ for all x. Now $f(x) = supp_{x}(h(x))$ Why? Lets consider a point x = a(show in figure) supremum value of h(x) at x=a would be the point touching the function(we choose the nearest point between all the blue points). So if we get $supp_x(h(x))$ for all x we get back our convex function f(x).

How do we prove that a function is a convex function?

In the first figure if we take any two points and join them with a line (shown as blue lines) all the points on the line are above the function so its a convex function.
But in case of second figure the function is not convex as the points on the pink line are above and below the function.

The convex/Legendre conjugate function of entropy is negative log partition function.
Given E[X] = r, X={1, 2, 3, 4…}
maximize $\sum_{i=1}^{\infty} P_{i}logP_{i}$

Method of Lagrange multipliers:

$L = \sum_{i=1}^{\infty} x_{i}logx_{i} + \lambda(\sum_{i=1}^{\infty}ix_{i}) + \mu(\sum_{i=1}^{\infty}x_{i} – 1)$

$\frac{dL}{dx_{i}} = logx_{i} + 1 + \lambda i + \mu$

setting $\frac{dL}{dx_{i}} = 0$

$x_{i} = e^{-1 -\lambda i -\mu}$

lets assume $y = -\lambda i -\mu$

then $x_{i} = e^{y -1}$ where y is a function of $\lambda$ and $\mu$

Lets use the convex conjugate:

$f(x) = \sum_{i=1}^{\infty} x_{i}logx_{i}$

Writing the convex conjugate:

$f^{*}(y) = supp_{x}(yx – f(x))$

$f^{*}(y) = supp_{x}(yx – \sum_{i=1}^{\infty} x_{i}logx_{i})$

$\frac{d}{dx_{i}} yx – \sum_{i=1}^{\infty} x_{i}logx_{i} = 0$

$y – (1+logx_{i}) = 0$

$x_{i} = e^{y – 1}$

$f^{*}(y) = \sum_{i=1}^{\infty} e^{y_{i}-1}$

What if P is not a exponential function? If P is not a exponential function, the solution is going to be pretty hard but we can use approximations. What we do is, we first take the convex conjugate of P and after we take the convex conjugate of P we take the double convex conjugate of P, which will involve some random variable.

Lets talk about generalized divergence. Lets consider a convex function f, Bregman divergence is basically the distance between function f and its linear approximation as shown above.

Lets consider a convex function $\phi$ and define divergence of 2 points x and y. x and y could be real numbers or functions. From the above figure we can see that the divergence is given by
$$d_{\phi}(y,x) = \phi (y) – \phi (x) – \triangledown \phi (x)^{T}(y – x)$$
what is the distance, how good is it that we want to approximate Y with this X? That’s the meaning of the Bregman divergence.

Interestingly if $\phi$ is negative entropy then the Bregman divergence is KL Divergence.

$\phi (x) = \sum_{i=1}^{n}x log(x)$

$d(y,x) = \sum_{i=1}^{n}ylog(y) – \sum_{i=1}^{n}xlog(x) – \sum_{i=1}^{n}(1+log(y))(y-x)$

$= \sum_{i=1}^{n}ylog(y) -\sum_{i=1}^{n}xlog(x) + \sum_{i=1}^{n}x – \sum_{i=1}^{n}y + \sum_{i=1}^{n}xlog(x) -\sum_{i=1}^{n}ylog(x)$

$= \sum_{i=1}^{n}y log\frac{y}{x} + \sum_{i=1}^{n}x – \sum_{i=1}^{n}y$

if x and y are pdfs $\sum_{i=1}^{n}x = 1$ and $\sum_{i=1}^{n}y = 1$

$= \sum_{i=1}^{n}y log\frac{y}{x}$ (KL Divergence!)

Properties of Bregman divergence:

1. $d_\phi(y,x)\ge 0$

2. convex in y

3. It is a linear in $\phi$, which is pretty obvious from the definition.

4. Linear separation $\{x| d_\phi(x,u)=d_\phi(x,v)\}$ is a hyperplane ie We take two points U and V (U and V could be one dimension, distribution or something more complicated). All of these points x, such that xu and xv, have the same divergence defined by $\phi$ as a hyper plane or as a hyper space. The dimensionality of this plane is dimensionality of U and dimensionality of V. It could be infinite dimension, if we think of U and V as functions.

5. $\mbox{argmin}_u \mathbf E_X d_\phi (X, u) = E_X X$. The expected value of the divergence between X and some fixed number U, over distribution of X is X. the best way to take U is the expected value of X.

6. Conjugate Duality: Let $\psi(\theta)=\phi^*(\theta)$ be conjugate of $\phi(x)$ then $d_\phi(x_1,x_2)=d\psi(\theta_1,\theta2)$. The Bregman divergence defined by $\phi$ equals the Bregman divergence defined by \psi which is the conjugate function.

# Hidden Markov Model

In this lecture,  we will review the exact inference of hidden Markov models and the Kalman filter. They’re pretty similar. The hidden Markov model can be explained as a graphical model. It’s a tree structure.

The definition of HMM is given as: The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. It is only the outcome, not the state visible to an external observer and therefore states are “hidden” to the outside; hence the name Hidden Markov Model. Briefly, the HMM is about finding the probability of the observation to be in a certain state. The difference between Markov Model and HMM is that in HMM the observation is also a probabilistic function of the state.

In a mathematical way, HMM is described as:

$(X_t)\in 1, 2, …, N$, where $t=1, 2, 3, …\\$

$(Y_t)\in 1, 2, …, N$, where $t=1, 2, 3, …$

$P(X_{t+1}=j|X_t=i)=p_{i,j}$  $X_0$ is given.

$P(Y_{t}=j|X_t=i)=q_{i,j}$

In the model, X is hidden, or what we call latent. Ys are observations.

There are three fundamental problems that we can solve using HMMs. The problem is that given $Y_1$,$Y_2$, to $Y_t$, what can we say about $X_1$,$X_2$, to $X_t$? First, we can get the probability distribution of Xt under the condition of all the observations. In this case, it is the inference problem. The second  is that what is the maximum likelihood sequence of $X_1$,$X_2$, …, $X_t$ that maximizes the joining probability of the Xs and Ys. Third, given a sequence of observations of $Y_1$ to $Y_T$, what is the parameters. The parameters are the probability of stating function from  one state at time t to another state at time t+1 in the weather example.

In another word, the problems are:

1. Given the HMM model and the observation sequence, find the probability distribution of the state at a certain time.
2. Given the Hidden Markov Model and the observation sequence, find the optimal state sequence for the underlying Markov process.
3. Given an observation sequence, find the parameters of the Hidden Markov Model that maximizes the probability of the sequence.

# Forward-backward algorithm

The forward-backward algorithm has very important applications to both hidden Markov models (HMMs) and conditional random fields (CRFs). It is a dynamic programming algorithm, and is closely related to the Viterbi algorithm for decoding with HMMs. The forward-backward algorithm is comprised of the forward step and a backward step. We have several statistics:

Forward statistic/probability:

$\alpha_t(i)=P(Y_1, …, Y_t, X_t=i)$

$\hat{\alpha}_t(i)=P(X_t=i | Y_1, …, Y_t, X_t=i)=\frac{\alpha_t(i)}{\sum_i\alpha_t(i)}$

where $\alpha_t$ is the forward parameter, which is the probability for us to get the latent state, when $X_t=i$, conditional on all observations up to time t. $\hat{\alpha}_t$ is the conditional probability of the latent state $X_t$, conditional on previous observations, $Y_1$ to $Y_t$.  $\hat{\alpha}_t$ is obtained by normalizing the $\alpha_t$. According to the Bayesian theorem, there is a recursive relationship between $\alpha_t$ and $\alpha_{t+1}$ which is given by:

$\alpha_{t+1}(i)\propto \sum_j\alpha_t(j)p_{j,i}q_{i,y_{t+1}}$

where $p_{j,i}$ is transition probability that $X_t=j$ and $X_{t+1}=i$. This term is usually called the transition term. $q_{i,y_{t+1}}$ denotes the conditional probability of the observation $y_{t+1}$, condition on $X_{t+1}=i$. The term is usually called the observation term. The scaled version can be obtained in a similar way.

$\hat{\alpha}_{t+1}(i)\propto \sum_j\hat{\alpha}_t(j)p_{j,i}q_{i,y_{t+1}}$ recursion

$c_{t+1}=\sum_i\sum_j\hat{\alpha}_t(j)p_{j,i}q_{i,y_{t+1}}$ scaling factor

Then we can find out the backwards statistic by the same kind of iteration.

Backward statistic:

$\beta_t(i)=P(y_{t+1}, …, y_T | X_t(i))=\sum_jp_{i,j}\cdot \beta_{t+1}(j)\cdot q_{j,y_{t+1}}$

$\hat{\beta}_t(i)=\frac{\beta_t(i)}{\prod _{s=t+1}^T c_s}$

$\hat{\beta}_t(i)=\frac{1}{c_{t+1}}\sum_j p_{i,j}\cdot \hat{\beta}_{t+1}(j) \cdot q_{j,y_{t+1}}$

From the forward statistic and the backward statistic, we can find out the one-slice statistic, which is defined as gamma, and which is the probability of a latent state X at time t, conditional on all observations from $y_1$, to $y_T$.

One-slice statistic:

$\gamma_t(i)=P(X_t=i | y_1, …, y_T)=\hat{\alpha}_t(i)\cdot\hat{\beta}_t(i)$

We can also find the two-slice statistic, which is the probability distribution of $X_t$, $X_{t+1}$, conditional on $Y_1$ to $Y_T$.

Two-slice statistic:

$\xi_t(i,j)=P(X_t=i, X_{t+1}=j | y_1, …, y_T)=\hat{\alpha}_t(i)\cdot p_{i,j} \cdot \hat{\beta}_{t+1}(j) \cdot q_{j,y_{t+1}}$

In addition we can think about the forward-backward algorithm in another way, which is in the forward step we estimate the $\hat{\alpha}_t$ for t from 1 to T, which is the conditional probability of the latent variable, based on observations up to time T. Then in the backward step, we estimate the $\gamma_t$ for t from T down to 1. When we need to estimate $\gamma_t$ for t from T down to 1, one way is that estimating $\gamma_t$ backwards from T to T-1 to T-2, and so on. What we know is the $\gamma_T$ which is the probability of $X_T$, conditional on $Y_1$ to $Y_T$. That one is just the $\hat{\alpha}_T(X_T)$, which is already known. Then we can get the $\gamma_t$ from $\gamma_{t+1}$ using recursion. In this way, $\hat{\beta}_t$ can be treated as the division between $\gamma_t$ and $\hat{\alpha}$.

# Viterbi algorithm

The idea of Viterbi algorithm is actually pretty easy, although the notation is convoluted. It is a dynamical programming algorithm. The idea is that given $Y_1$ to $Y_T$, we are going to find out the sequence of $X_1$ to $X_T$ that maximizes the probability of $X_1, …, X_T$ and $Y_1, …, Y_T$.

Forward step:

$V_{1,k}=1$ for $X_1=k$ and 0 otherwise

$V_{t,k}=P(Y_t=y_t | X_t=k) \cdot max_j(V_{t-1}\cdot p_{j,k})$, maximum probability over all sequence $X_1, …, X_t$ with $X_t=k$

$U_{t,k}=arg~max_j(V_{t-1,j}\cdot p_{j,k})$ the previous state in the above sequence$X_1, …, X_{t-1}=U_{t,k}, X_t=k$

The trick here is recursion. Each time we calculate the $V_{t,k}$, which is the probability of the most probable state sequence responsible for the first $t$ observations that have $k$ as its final state. For example in time t, we have $V_{t,k}$, where $k\in{1,2,… , n}$. Then in time t+1, the $V_{t+1,k}$ can be obtained by checking the probability of all the sequences which end with state k. The probability of the sequence is calculated as follows. If the state in time t is j then the sequence maximize the probability of $X_1, …, X_t$ and $Y_1, …, Y_t$ has the probability $V_{t,j}$. Thus the sequence has state j at time t and state k at time t+1 is $V_{t,j}p_{j,k}$, where $p_{j,k}$ is the transition probability from state j to state k. That is the kind of recursion. After we end up with time T, we just go backwards. We say what is the sequence? What is the previous state of the sequence that ends up with the maximum probability that ends up with.

Backward step:

$x_T=arg~max_k V_{T,k}$

$x_t=U_{t+1,x_{t+1}}$

By this kind of forward-backward algorithm, we can find out the sequence that has the maximum probability, and that has the maximum probability, based on observation. That is the algorithm that involves forward step and the backward step. The notation is pretty convoluted, but the idea is pretty simple.

# Maximum likelihood estimator given latent state sequence

• HMM assumes a sequence of Random Variables to be conditionally independent given a sequence of latent state variables which forms a Markov chain.
• We can start by maximizing the likelihood equation and will conclude that Expected Maximization algorithm is best path for MLE of Hidden Markov models.
• The situation is that we know only the observations and we do not know the Latent States. In this case we yield to EM algorithm. Instead of maximizing the likelihood, we maximize the expected likelihood.

## Maximum likelihood approach

Let’s say we have observed some data

$Y = {(y_1, y_2…y_n)^T}$
we want to fit a likelihood (or posterior) model by maximizing log-likelihood (or posterior) $l(\theta;Y) = log (p(Y|\theta))$

To formalize this problem, we will follow the standard statistical practice of introducing a family of candidate models for consideration. To this end, let $(\Theta, H)$ be a measurable space, called the parameter space. We consider that $\theta \in \Theta$ and the law of the Hidden Markov model $(X_t, Y_t) t >= 0$ defined by $P^\theta$. Our goal is now to select, on the basis of an observed sequence $Y = {(y_1, y_2…y_n)^T}$, a parameter estimate $\theta \in \Theta$ such that the observation statistics are well described by the law $P^\theta$.

What makes for a good parameter estimate $\hat{\theta}$ ? Note that the estimator depends, by definition, on the observed training data $\hat{\theta}$ = $\hat{\theta_n}(y_1, y_2…y_n)$. We would like to guarantee that the estimate $\hat{\theta_n}$ is close to the ‘true’ value of $\theta$ for large n, regardless of what the ‘true’ parameter happens to be. To be precise, we would like to show that  $\hat{\theta_n}$ \to $\theta$ in $P^\theta$ for every $\theta \in \Theta$.

Suppose that we don’t know the explicit form of $p(y|\theta)$, instead we know there are some unobserved (hidden) variable x, and we can write down $p(y|\theta)$ as an integration of the joint probability of y and x, so the process steps in this analysis are as follows:

We have defined Hidden Markov model: $(X_t, Y_t)$ for $t \ge 0$

Also, from previous section: $P(X_1,…,X_T,Y_1,…,Y_T) = \pi_tP(X_t|X_(t−1))P(Y_t|X_t)$

Maximize log likelihood

$l(\theta; Y) = log \sum\limits_{t} p(Y_t, X_t | \theta)$.

$=\sum\limits_{t} log P(X_t|X_(t−1)) + \sum\limits_{t} log P(Y_t|X_t)$

$= \sum\limits_{i} \sum\limits_{j} (log p_{i,j} \sum\limits_{t} (X_t = i, X_{t+1} = j)) + \sum\limits_{i} \sum\limits_{j} (log q_{i,j} \sum\limits_{t} (X_t = i, Y_{t} = j))$

Subject to $\sum\limits_{j}p_{i,j} = 1$ and $\sum\limits_{j}q_{i,j} = 1, \forall i$

$p_{i,j} = \frac{\sum\limits_{t}(X_t = i, X_{t+1} = j)}{\sum\limits_{t}(X_t = i)}$

$q_{i,j} = \frac{\sum\limits_{t}(X_t = i, Y_{t} = j)}{\sum\limits_{t}(X_t = i)}$

Directly maximizing $l(\theta; Y)$ of this form is difficult. Instead of examining through all possible x and maximizing their sum, we are going to use an iterative, greedy searching technique called Expectation-Maximization to maximize the log-likelihood. The algorithm will converge to a local maximum of the likelihood function.

## Expectation-Maximization : Introduction

The EM algorithm is applied for optimization of the likelihood functions in cases where the model is specified with probabilities in terms of an observed and an unobserved (latent) component. HMMs are models of this form, hence HMMs form a class of models for which the EM algorithm can be useful.

In general, if the model consists of two components (X,Y), which we assume take values in a finite space for simplicity, and if the probabilistic model specification consists of the joint point probabilities $P^\theta(X,Y)$, parametrized by $\theta$, then the likelihood when observing only $X=x$ is:

$l(\theta; Y) = log \sum\limits_{t} p(Y_t, X_t | \theta)$

This is not a simplified sum. For HMMs the sum will be over all possible transitions between the hidden states, which quickly becomes a formidable number when the length of the observed sequence grows. Fortunately there are algorithms for HMMs (forward-backward) for fast computation of the likelihood, and the likelihood could then, in principle, be plugged into any general purpose optimization algorithm for maximum-likelihood estimation of $\theta$. The alternative is the EM-algorithm. This is an algorithm that iteratively alternates between:

• The E-step, which is a computation of a conditional expectation given the observed x under the current estimate of $\theta$.
• The M-step, which is a maximization.

The EM-algorithm makes most sense if the two steps above can be implemented in a computationally efficient way, e.g. when we have closed form expressions for the conditional expectation and the maximization.

## Expectation-Maximization : Jensen’s Inequality, Algorithm and Equations

We maximize the expected value of this logarithm. The reason for us to do this is by Jensen’s inequality.

### Jensen’s Inequality

Jensen’s inequality is a general result in convexity. Let $f$ be a function whose domain is the set of real numbers. Recall that $f$ is a convex function if $\frac{d^{2}y}{dx^{2}}\ge 0$ (for all $x \in R$).

If $\frac{d^{2}y}{dx^{2}} > 0$ for all x, then we say $f$ is strictly convex. Jensen’s inequality can then be stated as follows:

Theorem: Let $f$ be a convex function, and let $X$ be a random variable. Then

$E[f(X)] \ge f(E[X])$

Moreover, if $f$ is strictly convex, then $E[f(X)] = f(E[X])$ holds true if and only if $X = E[X]$ with probability 1 (i.e. if X is a constant).

EM Algorithm

Step One: Find a lower-bound of $l(\theta; y)$

First we introduce a density function q(x) called averaging distribution.
A lower-bound of the log-likelihood is given by:

$l(\theta; Y) = log(P(y|\theta))$
$= log(\sum\limits_{x}p(y,x|\theta))$
$= log(\sum\limits_{x}q(x)\frac{p(y,x|\theta)}{q(x)})$
$\ge \sum\limits_{x}q(x)\frac{p(y,x|\theta)}{q(x)}$
$= E_{q(x)}[log(p(y,x|\theta))] + Entropy[q(x)]$
$= L(q, \theta; y)$

The $\ge$ follows from Jensenâs inequality (log-concavity). More explicitly we can decouple $l(\theta; y)$ as the sum of three terms:

$l(\theta; y) = E_{q(x)}[log(p(y,x|\theta))] + KL[q(x) || p(x,y|\theta)] + Entropy[q(x)]$
Will learn about KL (Kullback Leibler) later in series.
The expectation term $E_{q(x)}[log(p(y,x|\theta))]$ is called the-expected-complete-log-likelihood (or Q-function). The equation says that the sum of the Q-function and the entropy of averaging distribution provides a lower-bound of the log-likelihood.

Step Two: Maximize the bound over $\theta$ and q(x) iteratively.

Look at the bound $L(q, \theta; y)$. The equality is reached only at $q(x) = p(x|y, \theta)$, and the entropy term is independent of $\theta$. So we have：

E-step:  $q^t = {argmax}_q \; L(q, \theta^{t-1};y) = p(x|y;\theta^{t-1})$

M-step: $\theta^t = {argmax}_{\theta} \; L(q, \theta^{t};y) = \;{argmax}_{\theta} \; E_{q^t(x)}[log(p(y,x|\theta))]$

# HMM algorithm covered so far

The inference problems that we often involve:

• probability of an observed sequence $P(Y_1,…,Y_T)$
• filtering $P(X_t|Y_1,…,Y_t)$
• smoothing $P(X_t|Y_1,…,Y_T)$
• best explanation ${argmax}_{X_1,…,X_T} P(X_1,…,X_T,Y_1,…,Y_T)$
• model selection

Caveats:

• EM doesn’t guarantee the asymptotic properties of MLE — consistency and asymptotic normality.
• Event rates and activity durations are often function of time
• Events could happen between observations
• Between HMM, if we have the latent variable then we can have maximum likelihood estimation of the parameters. If we do not, then all we can expect is that our estimation is the lower bound. The real values of the parameters can only make it better. We get some conservative estimation about the performance of our estimator.

# Discrete-time Gaussian process

From Wikipedia: In a Gaussian process, every point in some input space is associated with a normally distributed random variable. Moreover, every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions.

A Gaussian process is fully specified by its mean function $m(x)$ and covariance function $k(x,x’)$. This is a natural generalization of the Gaussian distribution whose mean and covariance is a vector and matrix, respectively. The Gaussian distribution is over vectors, whereas the Gaussian process is over functions. We can write:

$f \sim GP(m, k)$,

meaning: “the function f is distributed as a GP with mean function m and covariance function k”.

The individual random variables in a vector from a Gaussian distribution are indexed by their position in the vector. For the Gaussian process it is the argument x (of the random function f(x)) which plays the role of index set: for every input x there is an associated random variable f(x), which is the value of the (stochastic) function f at that location.

Latent State : $x_t$

Observed State : $y_t$

The point is to estimate latent state $x_t for t=1,2,…$from observations $y_t$

given the following:

• $latex x_t = F_t x_{t-1} + B_t u_t + w_t$, where $latex w_t\sim {\cal N}(\vec 0, Q_t)$
• $latex z_t = H_t x_t + v_t$, where $latex v_t\sim {\cal N}(\vec 0, R_t)$

Procedure to follow:

• Iterate over t = 1,2….
• Calculate probability distribution of $X_t$ from probability distribution of $X_{t−1}$
• Calculate joint distribution of $(X_t Z_t)$ and update $X_t$ as conditional distribution over $Z_t=z_t$
• This can be classified as Multivariate Normal Distribution, which is given by:
• $p(x;\mu,\sigma) = \; (2\pi)^{-k/2} |\Sigma|^{-1/2} exp(-1/2 (x – \mu)^T \Sigma^{-1} (x – \mu))$

# Multivariate Normal Distribution- Linear Transformation

Suppose we have x which is a random variable that belongs to the Gaussian distribution with mean $\mu$ and variance matrix $\Sigma$. Now, linear transformation is applied on x to create a new random variable $y=Ax+b$. We find the mean of $y$ by using the fact that $E$ is a linear operator.
$\bar{y}=E{(y)}=E{(Ax+b)}=AE{(x)}+b=A\bar{x}+b=A\mu +b$
Then we find covariance:

$Covariance=E{((y-\bar{y})(y-\bar{y})^T)}$
$=E{[(Ax+b)-â(A\bar{x}+b)][(Ax+b)-â(A\bar{x}+b)]^T}$
$=E\{{[A(xâ-\bar{x})][A(xâ-\bar{x})]^T}\}$
$=E\{{A(xâ-\bar{x})(xâ-\bar{x})^TA^T}\}$
$=AE\{{(xâ-\bar{x})(xâ-\bar{x})^T}\}A^T$
$=A \Sigma A^T$

(Linear transformation here involves first multiplying the matrix x by A, and transforming this matrix, so it involves rotation and scaling. Then we apply this Ax to another constant, b which involves shifting.That is $Ax+b$ involves first distorting X by stretching and rotation and then offset it by b.)

Thus we see that the resultant random variable also belongs to a Gaussian distribution with mean A$\mu$ +b and variance A$\Sigma$ $A^T$.

# Multivariate Normal Distribution – Conditional Probability

Given
$\left( \begin{array}{c} X_1 \\ X_2 \end{array} \right) \sim{\cal N}\left(\left(\begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right),\left(\begin{array}{c} \Sigma_{11}\quad \Sigma_{12} \\ \Sigma_{21}\quad \Sigma_{22} \end{array} \right)\right)$

The conditional distribution of $X_1$ given $X_2$ is:

$X_1 | X_2 = x_2\sim{\cal N}(\mu_1 + \Sigma_{12} \Sigma_{22}^{-1} ( x_2 – \mu_2), \Sigma_1-\Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21})$

Proof:

First we will take a look at the Schur’s complement which will be essential to find out the conditional distribution of $X_1$ given $X_2$.

## Schur complement

We need to find inverse of the covariance matrix $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)$.

In order to do that, we first use the 2nd column $\left(\matrix{\Sigma_{12}\cr\Sigma_{22}}\right)$ to Gaussian-eliminate $\Sigma_{21}$, which is equivalent to right multiply the covariance matrix with $\left(\matrix{I & 0 \cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$. We then use the resulting 2nd row $\left(\matrix{0 & \Sigma_{22}}\right)$ to eliminate $\Sigma_{12}$, which is equivalent to left multiply the covariance matrix with $\left(\matrix{ I & \Sigma_{12}\Sigma_{22}^{-1} \cr 0 & I}\right)$. We perform the following operation:

$\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right) \left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$
$= \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr & I}\right) \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\cr 0 & \Sigma_{22}}\right)$
$= \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$

If we move $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$ and $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)$ to the right hand side of the equation above, then $\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$ and the resulting $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)^{-1}$ to the left hand side of the equation, we get the following:
$\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)^{-1}$
$= \left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right) \left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right) \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$

## Application of Schur complement to find the conditional pdf of a multivariate normal distribution

Let $x = \left(\matrix{X_1\cr X_2}\right)\sim \mathcal{N}\left(\left(\matrix{\mu_1\cr\mu_2}\right), \left(\matrix{\Sigma_{11} &\Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right) \right)$ be multivariate normal distribution, where $X_1$ and $X_2$ are vectors of dimension $p$ and $q$ respectively, and we want to find the conditional distribution of $X_1$ on $X_2=x_2$.
The probability density function of $X$ is
$f\left(\matrix{x_{1}\cr x_{2}}\right) \propto \exp\left(-\frac{1}{2} \left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right) \left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1} \left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)$

Then:

$\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1}$

$=\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right) \left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right) \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$
Hence $X_{1}|X_{2}=x_{2}\sim\mathcal{N}\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)$

# Kalman Filter

A Kalman filter is a recursive estimator, a weighted average of the previous estimate (propagated to the present) and the new information, where the weights are optimized to minimize the squared error.

In layman’s term: We had many measurements in the past that are all packed neatly into the previous estimate. Now we have a new measurement. The output of the Kalman filter is a value somewhere in the middle, based on the validity of the previous estimate (how much data, its quality, and how dynamic is the model), and the accuracy of the new information.

## The process

The Kalman filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of (noisy) measurements. As such, the equations for the Kalman filter fall into two groups: time update equations and measurement update equations. The time update equations are responsible for projecting forward (in time) the current state and error covariance estimates to obtain the a priori estimates for the next time step. The measurement update equations are responsible for the feedback i.e. for incorporating a new measurement into the a priori estimate to obtain an improved a posteriori  estimate.

The time update equations can also be thought of as predictor equations, while the measurement update equations can be thought of as corrector equations. Indeed the final estimation algorithm resembles that of a predictor-corrector algorithm for solving numerical problems.

The Kalman filter model assumes that the state of a system at a time t evolved from the prior state at a time t-1 according to the equation:

$X_t = F_t X_{t-â1} + B_t u_t + w_t$

It means that each $x_k$  (our signal values) may be evaluated by using a linear stochastic equation (the first one). Any $x_k$ is a linear combination of its previous value plus a control signal uk and a process noise (which may be hard to conceptualize).

• $X_t$ is the state vector containing the terms of interest for the system (e.g., position, velocity, heading) at time t
• $u_t$ is the vector containing any control inputs (steering angle, throttle setting, braking force)
• $F_t$ is the state transition matrix which applies the effect of each system state parameter at time t-1 on the system state at time t (e.g., the position and velocity at time t-1 both affect the position at time t)
• $B_t$ is the control input matrix which applies the effect of each control input parameter in the vector ut on the state vector (e.g., applies the effect of the throttle setting on the system velocity and position)
• $w_t$ is the vector containing the process noise terms for each parameter in the state vector. The process noise is assumed to be drawn from a zero mean multivariate normal distribution with covariance given by the covariance matrix $Q_t$.

We are basically adding the action state in addition to the data to determine our position. This allows us to incorporate a moving state.

If we are pretty sure that the system into consideration fits into the above model, then the next step is to determine the necessary parameters and the initial estimates. As already mentioned, there are two distinct set of equations: Time update(prediction) and Measurement Update(correction).

For the prediction we basically find out what is the distribution of $X_t=F_t X_{t-1}+B_t u_t+w_t$, where $w_t\sim {\cal N}(0,Q_t)$, given that
$X_{t-1}\sim{\cal N}(\hat x_{t-1|t-1}, P_{t-1|t-1})$?

$X_{t}\sim {\cal N}(F_t \hat x_{t-1|t-1}+ B_t u_t, F_t P_{t-1|t-1}F^{\rm T}_t+Q_t)$

The standard Kalman filter equations for the prediction stage are:

predicted (a priori) state$\hat{x}_{t|t-1}=F_t\hat{x}_{t-1|t-1}+B_t u_t$

predicted (a priori) covariance $P_{t|t-1}=F_tP_{t-1|t-1}F_t^T+Q_t$

where $Q_t$ is the process noise covariance matrix associated with noisy control inputs.

To start the process, we need to know $x_0$,$P_0$. For the measurement update equations, we find out what is posterior distribution of $X_t$ given that the prior distribution is $X_{t}\sim {\cal N}(\hat x_{t|t-1}, P_{t|t-1})$, and we have made an observation $Z_{t} = H_t X_t + v_t=z_t$, where $v_t\sim {\cal N}(0,R_t)$

Innovation $\epsilon_t = z_t-H_t\hat x_{t|t-1}$
Innovation covariance $S_t = H_t P_{t|t-1} H_t^{\rm T} + R_t$
Optimal Kalman gain $K_t = P_{t|t-1}H_t^{\rm T}S_t^{-1}$
Updated (a posterior) state estimate $\hat x_{t|t} = \hat x_{t|t-1} + K_t\epsilon_t$
Updated (a posterior) covariate $P_{t|t} = P_{t|t-1}-K_t H_t P_{t|t-1}$