# Gibbs Sampling and Metropolis-Hastings Sampling [10/06]

Prepared by Aayush Uppal, Madhur Gupta and Nishchita Purushothama Patel

### Gibbs Sampling

#### Introduction to Monte Carlo Sampling

Gibbs Sampling is a Sampling technique, So the first fundamental question is, Why Sampling ?

Gibbs Sampling allows us to sample from a distribution that asymptotically follows P(π|X) (Prior information) without having to explicitly calculate the integrals. Gibbs Sampling is an instance of a Markov Chain Monte Carlo technique. Monte Carlo methods are algorithms that help you obtain a desired value by performing simulations involving probabilistic choices.

Understanding Monte Carlo technique with a simple example to estimate value of $\pi$

Draw a perfect square on the ground. Inscribe a circle in it i.e. the circle and the square are centered in exactly the same place, and the circle’s diameter has length identical to the side of the square. Now take a bag of rice, and scatter the grains uniformly at random inside the square. Finally, count the total number of grains of rice inside the circle (call that C), and inside the square (call that S). Therefore we can say that: $\frac{C}{S} \approx \frac{\pi(\frac{d}{2})^2}{d^2}$. Solving for $\pi$, we get $\pi \approx \frac{4C}{S}$

We just solved a problem by approximating the values of integrals. The more grains of rice we use, the better our approximation will be.

Understanding the Markov Chains aspect

In the above example the distribution was uniform, now we will assume that the distribution is not uniform, also we are interested in $E_{p(x)} [f(x)]$ The expected value is represented as: $E[f(z)] = \int f(z)p(z)dz$.

Conceptually, the integral in equation sums up f(z)p(z) over infinitely many values of z. But rather than touching each point in the sum exactly once, if you sample N points z(0) , z(1) , z(2) , . . . , z(N) at random from the probability density p(z), then.

$E_{p(z)}[f(z)] = \lim_{N\to\infty} \frac{1}{N}\sum \limits_{n=1}^{N} f(z^{(t)})$

Looking at this equation, it’s clear that we can get an approximate value by sampling only a finite number of times, T:

$E_{p(z)}[f(z)] \approx \frac{1}{T}\sum \limits_{t=1}^{T} f(z^{(t)})$

So, now we have a way to approximate the integral. The remaining problem is this: how do we sample $z^{(0)}, z^{(1)}, z^{(2)}, \cdots , z^{(T)}$ according to $p(z)$. For our purposes, though, the key idea is to think of z’s as points in a state space, and find ways to “walk around” the space — going from $z^{(0)}$ to $z^{(1)}$ to $z^{(2)}$ and so forth — so that the likelihood of visiting any point z is proportional to p(z). Here g is a function that makes a probabilistic choice about what state to go to next according to an explicit or implicit transition probability $P_{trans}(z^{(t +1)} | z^{(0)}, z^{(1)}, \cdots , z^{(t )})$

Now what will make it a Markov Chain Monte Carlo technique is defining things so that the next state you visit, $z^{(t+1)}$, depends only on the current state $z^{(t)}$ . i.e:

$P_{trans}(z^{(t+1)}|z^{(0)}, z^{(1)},\cdots,z^{(t)}) = P_{trans}(z^{(t+1)}|z^{(t)})$

The heart of Markov Chain Monte Carlo methods is designing g s o that the probability of visiting a state z will turn out to b e p(z), as desired. Gibbs sampling is one algorithm that meets the conditions.

#### The Gibbs sampling algorithm

Gibbs sampling is applicable in situations where Z has at least two dimensions, i.e. each point z is really z = <z1 , . . . , zk>, with k > 1. The basic idea in Gibbs sampling is that, rather than probabilistically picking the next state all at once, you make a separate probabilistic choice for each of the k dimensions, where each choice depends on the other k-1 dimensions.
$z^{(0)} := <z_1^{(0)},…,z_k^{(0)}>$
for t = 1 to T do
for i = 1 to k do
$z_i^{(t+1)} \approx P( Z_i | z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i+1}^{(t)}….,,z_{k}^{(t)})$
end for
end for

The distribution we are sampling from by using the definition of conditional probability :

$P( Z_i | z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i+1}^{(t)}….,,z_{k}^{(t)}) = \frac{P(z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i}^{(t)},z_{i+1}^{(t)}….,,z_{k}^{(t)})}{P(z_1^{(t+1)},..,z_{i-1}^{(t+1)},z_{i+1}^{(t)}….,,z_{k}^{(t)})}$

#### Gibbs Sampler Stationary Distribution

The Gibbs sampler has been used extensively in the statistics literature. It relies on iteratively sampling from a set compatible conditional distributions and the sampler is known to converge to a Stationary distribution. The Markov Chain generated by Gibbs Sampler converges to the target density as the number of iterations become large. P(X) is the stationary distribution of the Gibbs sampler, i.e. $P(\phi) = \int P(\phi| X)P(X)dX = \int P(X, \phi) dX$. The reason is the following.

\begin{align}
\int P(\phi| X)P(X)dX = & \int d X_1 d X_2 P(\phi_1|X_2) P(\phi_2|\phi_1)P(X_1,X_2) \\
= & P(\phi_2|\phi_1) \int d X_2 P(\phi_1|X_2) \int d X_1 P(X_1,X_2) \\
= & P(\phi_2|\phi_1) \int d X_2 P(\phi_1|X_2) P(X_2) \\
= & P(\phi_2|\phi_1) P(\phi_1) \\
= & p(\phi) \\
\end{align}

#### Detailed Balance

(Principle of detailed balance) Transition probability matrix P and probability distribution $\pi$ are said to be in detailed balance, or equivalently, the principle of detailed balance is said to hold, if: $\pi_i p_{ij} = pi_j p_{ji} \forall_{i,j} \epsilon S$

Ultimately if the detailed balance equations are satisfied, it implies the markov chain is reversible and it has an invariant distribution. Gibbs sampler has detailed balance if it

scans through components in fixed order, then
scans through components in reverse order

\begin{align} P(X)P(\phi|X) = & P(X) \int d X’_1 P(X’_1 |X_2) P(\phi_2|X’_1) P(\phi_1|\phi_2)\\ = & P(X) P(\phi_1|\phi_2) \int d X’_1 P(X’_1 |X_2) P(\phi_2|X’_1)\\ = & P(X_1|X_2)P(\phi_1|\phi_2) \int d X’_1 P(X’_1) P(X_2|X’_1) P(\phi_2|X’_1)\\ = & P(\phi)P(X|\phi) {\rm~by~symmetry} \end{align}

Gibbs Algorithm is a special case of Metropolis-Hastings Algorithm,i.e Metropolis is a generalization for sampling . Metropolis sampling algorithm can draw samples from a distribution of complex random Variable $X$. $X$ could be a stochastic process or e.g. a process of formation of random network model or whole social diffusion process.

### Metropolis-Hastings Sampling

In order to work with Metropolis-Hastings algorithm our only requirement is that we should be able to write down the $P(X)$ up to some scaling factor.

The Metropolis algorithm first proposes a possible new state $X^{*}$ in the Markov chain, based on a previous state $X^{(t-1)}$, according to the proposal distribution $q(X^{*} | X^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $P(X)$ evaluated at $X^{*}$ .

One constraint of the Metropolis sampler is that the proposal distribution $q(X^{*} | X^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time t, the probability of moving from $X^{(t-1)}$ $\rightarrow$ $X^{(t)}$ must be equal to the probability of moving from $X^{(t-1)}$ $\rightarrow$ $X^{(t)}$ , a condition known as reversibility or detailed balance.

Metropolis-Hastings algorithm implements an additional correction factor \textit{c}, defined from the proposal distribution as
$$c = \frac{q(X^{t-1} | X^{(*)})}{q(X^{*} | X^{(t-1)})}$$

The correction factor adjusts the transition operator to ensure that probability of moving from $q(X^{*} | X^{(t-1)})$ is equal to the probability of moving from $q(X^{*} | X^{(t-1)})$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$. Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set $t$ = 0, generate an initial state $X^{(0)}$ $\sim$ $\pi^{(0)}$
2. repeat until $t$ = $M$
3. set $t$ = $t$+1
4. generate a proposal state $X^{*}$ from $q(X | X^{(t-1)})$
5. calculate the proposal correction factor $c = \frac{q(X^{t-1} | X^{(*)})}{q(X^{*} | X^{(t-1)})}$
6. calculate the acceptance probability $\alpha = \min \left(1,\frac{P(X^{*})}{P(X^{(t-1})}\times c \right )$
7. draw a random number $u$ from $Unif(0,1)$
1. if $u \leq \alpha$ accept the proposal state $X^{*}$ and set $X^{(t)} = X^{*}$
2. else set $X^{(t)} = X^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

#### Metropolis-Hastings Sampling — Example

mh.gamma <- function(n.sims, start, burnin, cand.sd, shape, rate) {
draws <- start
theta.update <- function(theta.cur, shape, rate) {
theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd)
accept.prob <- dgamma(theta.can, shape = shape, rate = rate)/dgamma(theta.cur, shape = shape, rate = rate)
if (runif(1) <= accept.prob) theta.can else theta.cur
}
for (i in 2:n.sims) draws = c(draws, theta.update(draws[i-1], shape = shape, rate = rate))
return(draws[(burnin + 1):n.sims])
}
mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2, shape = 2, rate = 4)
plot(density(mh.draws))
lines(seq(0,2,length.out = 100), dgamma(seq(0,2,length.out = 100), shape=2, rate=4), col='red')
legend('topright',lty=1, col=c('black','red'), legend=c('M-H','gamma(shape=2,rate=4)'))


#### Detailed Balance of Metropolis-Hastings algorithm

$P(\phi|X) = q(X,\phi)\alpha(X, \phi)$, if $X\ne\phi$

$P(X|X) = 1-\int d\phi q(X,\phi)\alpha(X,\phi)$

Hence
\begin{align}
& P({Y\le\phi}|X) = \int_{-\infty}^\phi dY q(X,Y)\alpha(X,Y) + 1(\phi\ge X)\left(1-\int dY q(X,Y)\alpha(X,Y)\right)\\
\Rightarrow & p(X, \phi) = {d P({Y\le\phi}|X) \over d\phi} = q(X,\phi)\alpha(X,\phi) + \delta(X-\phi)\left(1- \int dY q(X,Y)\alpha(X,Y)\right)\\
\Rightarrow & P(X)p(X,\phi) = P(X) q(X,\phi) \min \left(1,{P(\phi) q(\phi, X) \over P(X) q(X, \phi)} \right) \\
& +\delta(X-\phi)\left(P(X)-\int dY P(X) q(X,Y) \min \left(1,{P(Y) q(Y, X) \over P(X) q(X, Y)} \right) \right)\\
\Rightarrow & P(X)p(X,\phi) = \min \left(P(X) q(X,\phi),P(\phi) q(\phi, X) \right)
+\delta(X-\phi)\left(P(X)-\int\min dY \left(P(X) q(X,Y), P(Y) q(Y, X) \right) \right)\\
& =P(\phi)p(\phi, X) {\rm~by~symmetry}
\end{align}

### Gibbs Sampling from Hidden Markov Models

A hidden markov model is defined in terms of the following 3 latent states: $X(t)\in \{1,\cdots,k\}$, State transition kernel $p_{ij} = P(X_{t+1}=j|X_t=i)$ and the initial state $X_0$, and Observation model $P(Y_t|X_t)$

One simple way to sample from the hidden Markov model is to iterate through every sample $X_t$ for $t = 1,\cdots,T$. Here we fix everything else and write down the probability of $X_t$ as given below:

\begin{align*} & P(X_t | X_1,\cdots,X_{t-1},X_{t+1},\cdots,X_T,Y_1,\cdots,Y_T)\\ =& (P(X_1,\cdots,X_T,Y_1,\cdots,Y_T))⁄(\sum_{X_t} P(X_1,\cdots,X_T,Y_1,\cdots,Y_T))\\ \end{align*}

However in order to make an efficient sampling algorithm it has to satisfy the following 2 conditions:

• Sample points should diffuse quickly inside the sample space.
• The equilibrium distribution of the sample should be very similar to the distribution that is desired.

However the problem with the above sampling algorithms is that the points will diffuse slowly. This is because we fix the previous values and the next values at every state. This means although the above algorithm works, it takes infinite amount of time for it to ultimately mix well to be able to explore the whole probability space. To avoid this, we use the following algorithm:

Forward Filtering Backward Sampling Algorithm

1. We first conduct the forward sweep and extract the forward statistic $\alpha_t(X_t)= P(X_t|Y_1,\cdots,Y_t)$ for $t=1,2,\cdots$
2. Next step is the backward sampling stage. Here, we sample $X_T$ first and so on until $X_0$

\begin{align*} & P(X_{t-1}│X_t,Y_1,\cdots,Y_T)\\ =& \frac{P(X_{t-1},X_t,Y_1,\cdots,Y_T)}{P(X_t,Y_1,\cdots,Y_T)}\\ =& \frac{P(X_{t-1},Y_1,\cdots,Y_{t-1})P(X_t|X_{t-1})P(Y_t,\cdots,Y_T|X_t)}{\sum_{X_{t-1}} P(X_{t-1},Y_1,\cdots,Y_{t-1})P(X_t|X_{t-1})P(Y_t,\cdots,Y_T|X_t)}\\ =& \frac{\alpha_{t-1}(X_{t-1})P(X_t|X_{t-1})P(Y_t|X_t)\beta_t(x_t)}{\alpha_{t-1}(X_{t-1})P(X_t|X_{t-1})P(Y_t|X_t)\beta_t(x_t)} \end{align*}