# Expectation Propagation [10/01]

Prepared by Ashish Gupta, Utsav Patel and Santhosh Somarapu.
Wen Dong will add the math of the lecture into this note or will turn the math into a problem in problem set 3/4.

Belief Propagation

Belief propagation, also known as sum-product message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes.

Suppose we have this graphical model shown above. We can convert this graphical model into a cluster graph / component graph / clique tree shown below by removing the directed edges as directed edges don’t matter when we apply the forward backward algorithm for inference . The different components are shown in the the diagram below.

Cluster Label : A,B,C,D in order

A more simplified version of the component graph is shown below. Each cluster has jurisdiction over some variables. For example, cluster A has jurisdiction over variable x1 , cluster B has jurisdictions over variables x1 and x2, cluster C has jurisdiction over variables x2 and x3 and so on. Now these components are going to talk to each other and try to convince each other about the state of the variable which they each have jurisdiction on in common. So cluster A is going to talk to cluster B about the about the state of the variable x1 and try to make cluster B more informed about x1 in order to get a better distribution over variable B.

All these clusters are going to talk to each other via message passing. Each cluster sends forward all the incoming message times its current belief to the next cluster as the message and this goes on until convergence, until the variable distributions don’t change much.

Approximate Belief Propagation

Let alpha denote the forward messages and beta denote the backward messages . Let both alpha and beta belong to exponential distributions. Given statistics, which is the expected value of those features under the distribution, we can find out those alpha. Given those alpha, the parameters, we can find out the expected value of those features. There’s a nice duality between the set of features and the parameters in the exponential family.

Exponential family is denoted by a set of features. Assuming exponential distribution for alpha(t-1) , we can propagate the information to alpha(t) and using this we can estimate the features for our exponential distribution and because of the duality between features and parameters for exponential distribution , we can estimate the parameters.Similarly we can propagate from beta(t) to beta(t-1) and estimate the features and the parameters.

Stochastic Kinetic Models

Stochastic kinetic models (SKMs) are a highly multivariate jump process used to model chemical reaction networks, particularly in biochemical and cellular systems.

An example of such system would be: A mixer of some molecules of different chemical species or reactants that are interacting and producing diffrerent chemical products. There can be many possible products generated from all processes and we want to know the possible composition of the mixer after some time.

[$\bullet$] we have a combination of M chemicals interacting in a system, we can specify the initial combination of the chemicals (at time t=0) as
$x_0 = \{x_0^{(1)}, x_0^{(2)}, … x_0^{(M)}\}$
where $x_0^{(m)}$ is the initial quantity of $m^{th}$ chemical.

[$\bullet$] At time t, the combination is $x_t = \{x_t^{(1)}, x_t^{(2)}, … x_t^{(M)}\}$

[$\bullet$]Among these chemicals, there can be many possible reactions that can occur and change the state of the chemical composition.

E.g.

$A + C + B \rightarrow D + E$
$C+ E\rightarrow B+A$
$A + E \rightarrow F$

[$\bullet$] We can assume that there are total V reactions and each reaction’s rate is dependent on the quantity of the reactant chemicals at time t ($x_t$) and a reaction constant $C_k$.

If the process contains just one reactant, the rate $h_k(x_k, c_k) \propto c_k * x_k$.

If there are more than one reactants then the rate $h_k(x_k, c_k) \propto c_k \prod_{i=1}^{M} x_k$.

Also each reaction k, changes the composition by $\Delta_k$. If the initial system is converted from $\alpha^{(1)}_k*X^{(1)}+ … +\alpha^{(1)}_k*X^{(M)} \rightarrow \beta^{(1)}_k*X^{(1)} + … +\beta ^{(1)}_k*X^{(M)}$
• Then, $\Delta_k = (\beta^{(1)}_k – \alpha^{(1)}_k, … ,\beta^{(M)}_k – \alpha^{(M)}_k)$
• So we want to estimate the state $x_t$ of the chemicals probabilistically using stochastic simulation.

Gillespie’s Algorithm Implementation
[$\bullet$] Given the initial state of the composition $x_0 = \{x_0^{(1)}, x_0^{(2)}, … x_0^{(M)}\}$, evaluate propensity function(or rate) for all reactions k that can happen at that time: $h_k(x_k, c_k)$.
[$\bullet$] From all possible reactions, choose one reaction randomly according to their weightage of rate ($\frac{h_k}{h_0}$) and the time $\tau$ untill the next reaction occurs.
[$\bullet$] Update the time and available composition of the mixer according to the used reactants and products. $t\leftarrow t+\tau, x\leftarrow x-\Delta_k$
[$\bullet$] Repeat step 2 and 3 as long as there are possible reactions that can occur from available composition. Or untill you arrive at the desired time to observe the system.
[$\bullet$] Repeating the process several times gives us additional information of how system behaves in intermediate states and what are the possible composition states that occurs frequently.
[$\bullet$] In this manner we can simulate the whole chemical chain reaction process as a stochastic model.
[$\bullet$] The same algorithm can be applied to many other examples such as:
given an initial state of predators and prays in an ecological system, what are the possible numbers of prays and predators that will thrive or survive in the system after some time.
[$\bullet$] Mathematically, the probability of a process k occurs with a rate $h(x_k, c_k)$ and after a time interval $\tau$ is be given by $P(k, \tau, x) = h(x_k, c_k)* exp(-(h(x_0, c_0)\tau)))$
[$\bullet$] Hence the joint probbility of the system over the whole time period is
$= \prod_{i=0}^n \Big[ h_{v_i}(x_{t_i}, c_{v_i})* exp(-h_0(x_{t_i}, c) * (t_{i+1} – t_i))\Big]$
$= \Big[\prod_{i=0}^n h_{v_i}(x_{t_i}, c_{v_i})\Big]* exp(\Sigma_{i=0}^n \{-h_0(x_{t_i}, c) * (t_{i+1} – t_i)\} )$

Stochastic Kinetic Model — Discretization
In order to make an inference about the observation of a system, and if we want to combine the observations with the dynamics, we’ll have to discretize the events, which is what a lot of people normally do in solving differential equations. Instead of sampling, we just increment the time step by one by one-time unit. We will not sample events.
Here, instead of sampling the next event, we first adding a new event, which means nothing has changed. We’re sampling whether it is a specific event happening, or whether it is a new event happening. Those are the two changes we have made in order to discretize this system.
We have applied the same idea previously in either throwing a coin repeated, or first estimate the time to the next successful, to the next failure. This is just the same. We estimate the time to the next success or next fail, the next weather change directly, change in the weather. All we just do, it’s a step-by-step, step-by-step, and say that’s time Y that hasn’t changed. Time 2 it hasn’t changed until sometime t, the weather changes. It’s the same idea. We can do it either way.
Based on this discretized, discrete-event model. Discrete event meaning they’re only finite at any given amount of time, there are only a finite number of events which have effect, changes of the system that is greater than something. This is always true. There are two different types of changes. One type of change is that in any small moment, we have infinite number of changes. Another change is discrete-event system.
If we count the number of events which are greater than some threshold, there will only be a finite number of events over any finite amount of time. Now we have discrete time, discrete-event model.

Projection — Moment Matching

Here we are coping with approximate messages passing. This means that we will first propagate forward step or in the backward step, we will first have some forward message alpha. Then we will propagate this forward message alpha over this system, so through this \psi, propagate the alpha through this \psi.

After we propagate over the psi, we will find out the sufficient statistics, and we’ll use the sufficient statistics to write new message in exponential form, and so on and so forth. Then Enumerate, all combinatorial states of those states, is pretty implausible. What we do is the following. The system looks like the following. Let’s write three components here.
t-1, t, t+1 and X1,X2,X3… and so on

After once I’ve sampled this event, then we will change the state those latent variables. You want to change the state of the different variables, we need to know the event, and we need to know the previous value of this latent variable. After we know the event, then we know the previous value. We know definitely how it to change this one. If we think of it in terms of inference, we need to enumerate all states, which is not possible. What we actually do is this. We will find the mean effect which is what we want to find.

Here, for example, I know that I’m currently the susceptible, and I have been in contact with 2 random people. What is the probability of that I will be infected? If I’m infected, then it will change my state according to this event. This is how this system works. In this we’ll have to enumerate the combinatorial states.

Here in this system, what we do is here I’m susceptible. Based on my average contact with the other persons, what is my probability of getting infected? This probability does not depend on the specific values of the events, this probability depends on which specific person I am in contact with. Here is what is the, for example, average amount of time I contact with this person? In this way, we decouple the relations of different events As a result, we will be able to make inference that is computationally traceable.

Projection — Parameter Learning

After finding the mean we need to estimate the parameters since in the system there are lot of noise observations and we don’t know the parameters so we need to work on the parameters from the noise observations. To find that first, we write down the probability, which is a factor as the probability after each time steps. After we write down the likelihood in X and Y and then log likelihood, we can write down the expected log likelihood. Of course, in order to write down the expected log likelihood, we need to know the probability distribution of the latent states. Since we do not know, so we just use our previous estimation of the probability distribution of the latent states what we will do is to take the partial derivative of the expected log likelihood over the parameters, which results in taking the partial derivative of this log of P over the parameter.