Discrete State Markov Processes [09/15/2015]

We will start off with discussions about several discrete state Markov processes and talk about some settings and their applications.

1. Simplified Weather System

In this system we have latent states and observations. Latent states (as opposed to observable states), are states that are not directly observed but are rather inferred (through a mathematical model) from other states that are observed. In the weather system, the latent state is “sunny day” or “rainy day”. The observations are whether we see somebody carrying an umbrella or we do not see anybody carrying an umbrella. Therefore, given a set of observations, we can infer the latent states.

State transition kernel: The system evolves through a state transition kernel (or state transition matrix). Using this kernel, we can find the probability of a latent state for the next day, given today’s latent state. For example, if we know the probability for the next day to be a rainy day, conditioned on that today is a sunny day, then we know the probability for the next day to be a sunny day, conditioned on the current day is a sunny day.

. St+1 Rt+1
St a11 a12
Rt a21 a22

Here, we have a state transition matrix. What we know is a11 + a12 = 1 and a21 + a22 = 1.
Observation model: This is the probability conditioned on that today is a sunny day or a rainy day, we see somebody carrying an umbrella or we do not see anybody carrying an umbrella.

. No Umbrella Umbrella
S b11 b12
R b21 b22

Similarly, we have b11 + b12 = 1 and b21 + b22 = 1.

Although this is a very simple model, there are actually many versions of this. Let’s discuss some of its variants:

  1. Stock Market: In the stock market, we have bear market and ox market. We can have a model which says that if today is a bear market, then there’s a probability X that the next day is going to be a bear market.
  2. Coin Toss: We can mimic the scenario with a coin toss model, such that, if we throw a coin repeatedly and then when there’s a heads up we gain one dollar and if we have a tail up I lose one dollar. Here, gaining or losing are latent states and the results of the toss are observations.
  3. Hidden Markov model in speech recognition: In this model, we have, different vowels and consonants as latent states. And as observations, we have phonemes which people normally use. The question is that based on the observation of the phonemes, “What are the vowels behind those phonemes?” by exploring both our observations and the relationships of the vowels.
  4. DNA sequence: This model is an application of matching DNA sequence with protein and with RNA. We know that ATCG are the building blocks of a DNA sequence. Now we match them together, and predict the components of proteins. This is one example of a hidden Markov model.

 

2. Predator-Prey Dynamics

This system has predator and prey individuals and it evolves around the following three events.

  1. X -> 2X: A prey we will reproduce and generate two prey individuals. The rate for each prey to reproduce is α. Now, if there are |X| number of preys in the system, then the rate of this event would be α⋅|X|.
  2. X + Y -> 2Y: A prey individual and a predator individual interact and as a result, we have two predator individuals in the system. The rate (or the probability per unit time) for this event to happen is β. If there are |X| number of preys and |Y| number of predators, the total rate, i.e. the total number of events in a unit time, is going to be β⋅|X|⋅|Y|.
  3. Y -> 0: A predator individual disappears from the system.

These sets of events make this system a stochastic process. We know that given any sample path of a stochastic process, we should have a probability assigned to this sample path and we need to work with this probability. Now, if we have some observations about this system we should be able to find out the maximum likelihood as dimensions about a sample path that best fits the observations. The question that we want to solve here is, given observations, how can we find out the sample path?

Let’s start by sampling the equation in the first event. At time 0 we have X0 number of predators and Y0 number of preys. Here, we sample this prey multiplication event for each of the X prey individuals. At Δt time, the probability for each of the prey individual is α ⋅ Δt. There are X number of them so the total probability is going to be α ⋅ Δt ⋅ |X|.

Now we will sample the second equation. This means that for each combination, we first take a prey individual and take a predator individual and sample this event. The rate of success is β. For each prey-predator combination we have β ⋅ Δd probability for this event to happen. Since there are X number of preys and Y number of predators, the total probability is going to be β ⋅ Δd ⋅ |X| ⋅ |Y|.

Finally, we sample the last equation. Sample Y to empty set and similarly we know that the probability is going to be γ ⋅ Δd ⋅ |Y|.

If Δt is small enough, then the probability for us to get more than one event at Δt is going to be very small. We can approximately think that there is only one event. Although we have made three different samples (could have been more if we had more than 3 events), we can approximately think that there will be only one event happens. Then, according to the event we update the state from X0 to X1 and from Y0 to Y1.

Ultimately, when we sample event one Δt, then X1 is going to be X0 plus one. If we have sampled event two, then X1 is going to be X0 – 1 and Y1 is going to be Y0 + 1. Similarly, if event three happens, then X1 = X0 and Y1 = Y0 – 1.

If we assign probabilities (as discussed above) corresponding to each sequence of predators and preys, we will have a probability assigned to this system. Although this is a pretty complicated system, the trick is that if we introduce an event and introduce the probability that event into this system, then we will be able to find out the probability of any sample path.

The secret way to sample this process without taking approximation is the following. First, we sample the time for the prey multiplication to happen. And then we sample the time for the predation event to happen and then we sample the time for the third event to happen and then take the minimum. After we get the minimum, we update the state of the system, the population, and then we alter the time. From that time we again sample the time to the next three events and then take the minimum and then update the system state again.

But this way of sampling is quite inefficient in the sense that at each time we sample three exponential distributions. One solution to this problem is to use events to define a probability and to assign a probability measure to the system.

 

3. SIS Epidemic Dynamics

In this system we have susceptible individuals and infectious individuals, with the following three events:

  1. S+I→2I, rate = β|S||I|: In this event, whenever a susceptible individual and a infectious individual meet in a dynamic social network through close contact, then there’s a probability β that the infectious individual will turn the susceptible individual infectious.
  2. I→S, rate = γ|I|: This event pertains the recovery of the infected individual.
  3. S→I, rate = α|S|: In this event a susceptible individual might get infected from somewhere outside of this system. This describes the condition that we do not cope with the complete or isolated systems.

Using these three parameters, we can write down the probability for any sample path.

$ \begin{eqnarray} && P\left({X_{n,t},Y_{n,t}:n,t},\alpha,\beta,\gamma\right)\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}P(X_{n,t+1}|\{X_{n’,t}:n’\},\alpha,\beta,\gamma)\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}\gamma^{1_{X_{n,t}=1\wedge X_{n,t+1}=0}}\cdot (1-\gamma)^{1_{X_{n,t}=1\wedge X_{n,t+1}=1}}\cdot \\ &&\left[\alpha+\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=1}}\cdot \left[1-\alpha-\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=0}}\cdot\\ &&\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ \end{eqnarray} $
Here we have individuals indexed by n and the time index by t. S represents the latent state and X(n, t) is either 0 or 1. If it’s 0 then this means that a person n is susceptible at time t, and if this is 1 this means that a person is infectious at time t. Y represents different symptoms for this individual n at time t. If we think about it in a Bayesian kind of way, then we have the probability distribution for the rate α, for rate β and for rate γ.

This is just one way of modeling the dynamics of the diffusion, this means that we’ll be able to search in the network and find out, for example, spreading of someone rumor that best agrees with our observation about, for example, what we have scraped from the internet.

 

4. Exponential random graph model (ERGM)

In this system we will talk about the dynamics of networks. We know that in a social network setting, we form friends and if A and B do not get in contact for quite some time, then the relationship between A and B will break. Another thing that we have is, if A and B are friends and if B and C are friends, then it is quite likely for A and C to be friends. There are many ways for us to form a social network. Another way to form a social network is that, if all of us take the same class and take the same course, then this means that we are more likely to become friends. The question here is how do we express this kind of complex dynamics about network formation as an evolution of stochastic processes and assign a measure?

One way to think about this is to just look at the snapshots that we have, rather than looking at it as a dynamic process. For example, for a social network involving a specific set of faculty staff and students in the Science department, what is the probability for this snapshot of a network? We can think about this problem in terms of how the network evolves through a set of events. First, we look at the random variable as the state of this system. If we think about this network of n nodes as a directed network then we will have n ⋅ n number of edges. Each edge can take either 0 or 1 and at each time we could add an edge or remove an edge. And there’s a probability for both of these events to happen. Similarly, by introducing events and the probabilities for these events to happen, we can define a stochastic process.

Now we look at the dynamics version of the system. Here we have a network X as indexed by time t. So sample the time to the next event to happen, which is an exponential distribution as we have seen in the previous predator-prey example and conditioned on that event to happen, we randomly sampled directed pairs which is i and j. That is, we just randomly sample an edge from the n ⋅ n edges. After that we decide according to the result of throwing a coin. Depending on to how we throw this coin, we can have a probability of either adding this edge or removing this edge. Therefore, as a result, even for a complicated dynamics such as the dynamics of the formation of a network we can still assign a probability measure.

Now we will discuss the equilibrium distribution of this kind of dynamics. If we let the network evolve for an infinite amount of time, then at the end, what is the probability distribution like? This probability distribution is normally exponential distribution in the sense that from now to the infinite future, we lose all information about what is the state of now.

Now, the probability distribution P is:
$ \begin{eqnarray} && \mathbf P(X_{i,j}=1|\underbrace{X\backslash X_{i,j}}_{\tiny X\mbox{ except }X_{i,j}}) = {P(X=\Delta_{i,j}^+x) \over P(X=\underbrace{\Delta_{i,j}^+x}_{\tiny x\mbox{ except }x_{i,j}=1})+P(\underbrace{X=\Delta_{i,j}^-x}_{\tiny x\mbox{ except }x_{i,j}=0})}\\ &\Rightarrow& \log{P(X_{i,j}=1|X\backslash X_{i,j})\over P(X_{i,j}=0|X\backslash X_{i,j})} = \sum_k \theta_k \left(z_k(\Delta_{i,j}^+x)-z_k(\Delta_{i,j}^-x)\right)\\ && \mbox{where }\delta^+_{i,j}z_k(x)=z_k(\Delta_{i,j}^+x)-z_k(\Delta_{i,j}^-x)\mbox{ is called change statistic.} \end{eqnarray}$

We claim that the equilibrium distribution satisfies:

$P(x)\propto \exp\left(\sum_i \theta_i z_i(x)\right)$

$z_i(x)$ is a feature. If we add in the normalization constant we will get:

$P(x)\propto \exp\left(\sum_i \theta_i z_i(x)\right) – \ln( \exp\left(\sum_i \theta_i z_i(x)\right))$

We can see that, if we can define a set of events, then we can assign probability measure to very complicated processes.

 

Stochastic Matrix

A (n X n) matrix is called a Stochastic matrix if all entries are non negative and the sum of each column vector is equal to 1.

For a stochastic matrix, every column is a stochastic vector.

If p is a stochastic vector and A is a stochastic matrix, then A*p is a stochastic vector.

Proof: Let v1, v2, ….., v n be the column vectors of A. Then,

Ap = p1 v1 + p2 v2 + ……… + pn vn

By summing, p1 + p2 + …. + pn = 1

A Stochastic matrix has an eigenvalue 1. All other eigenvalues are in absolute value smaller or equal to 1.

Proof: For the transpose matrix AT, the sum of row vectors is equal to 1. The matrix AT has the eigen vector
| 1 |
| 1 |
| 1 |
|….|
| 1 |

Because A and AT have the same determinant, also A – λIn and AT – λIn have the same determinant so that the eigenvalues of A and AT are the same. With AT having an eigenvalue 1, A also has an eigenvalue 1.

Assume now that v is an eigenvector with an eigenvalue | λ | > 1. Then An v = | λ |n v has exponentially growing length for n → . That implies that there is for large n one coefficient

[An]ij which is larger than 1. But An is a stochastic matrix and has all entries <= 1. The assumption of an eigenvalue larger than 1 can not be valid.

 

Time Reversed Markov Chain

Consider a Markov chain: ….,Xn-2 , Xn-1 , Xn , …..

Tracing the Markov chain backwards: ….,Xn , Xn-1, Xn-2 , …..

{Xn-i , i=0, 1 ,…} is also a Markov Chain.

Let the limiting probabilities be Πi .

For Markov chains, given the present, the past and future are independent.

“past” → “ future” , “future” → “past”

Given the present, the future and past are independent. Hence, reverse process is also a Markov chain.

Transition probabilities of reversed Markov Chain:

$Q_{i,j}={P(X_m=j | X_{m+1}=i)} = {P(X_m=j , X_{m+1}=i)\over P(X_{m+1}=i)} \\ ={P(X_m=j ) P(X_{m+1}=i | X_m = j)\over P(X_{m+1}=i)} \\ = Π_j . P_{i,j} / Π_j \\$

The reverse Markov chain has the same transition probability matrix as the original Markov chain.

A Markov chain is time reversible if Qij = Pij that is Πj Pji = Πi Pij .

P(seeing a transition from j to i) = P(seeing j) P(transit to i | j).

Πj Pji = Πi Pij means the probability of seeing a transition from j to i is the same as seeing a transition from i to j. A transition from j to i for the original Markov chain is a transition from i to j for the reversed Markov chain.

 

Stationarity and Equilibrium

A stationary distribution (also called an equilibrium distribution) of a Markov chain is a probability distribution Й = Й . P where P is a one-step probability matrix of a Markov chain. The state probability distributions do not change after a transition.

If a chain reaches a stationary distribution, then it maintains the distribution for all future time.

A stationary distribution represents a steady (equilibrium) state in the chain’s behavior.

Markov chain is stationary from time t is P(Xt = I) = P(Xt+1 = i) that is,

(P(Xt = i))i = ((P(Xt = i))i . (pi,j)i,j

Time Reversible Markov chain is stationary.

P(Xt+1 = i) = j P(Xt = j) pj,i = P(Xt = i)pi,j = P(Xt = I)

The converse of this is not true.

If a Markov chain has a limiting distribution Pt → P, the limiting distribution is called equilibrium distribution. Equilibrium distribution is stationary distribution whereas the converse is not true when stationary distribution is not unique.

 

Continuous State Markov Chain

As in the case of discrete state Markov Chain which stays in a particular state for exactly one unit of time, we relax this property to allow chain to spend continuous amount of time in any state but in such a way as to retain the properties of Markov Chain.

Assume throughout that our state space is S = Z = {· · · , −2, −1, 0, 1, 2, · · · } (or some subset thereof). Suppose now that whenever a chain enters state ἰ S, independent of the past, the length of time spent in state ἰ is a continuous, strictly positive (and proper) random variable Hi called the holding time in state ἰ. When the holding time ends, the process then makes a transition into state j according to transition probability Pij , independent of the past. So if we take ∆t to be infinitesimally small we will get a continuous state Markov Chain. Letting X(t) denote the state at time t, we end up with a continuous-time stochastic process {X(t) : t ≥ 0} with state space S.

To make sure that a Continuous Process satisfies the Markov property(The future, {X(s + t) : t ≥ 0}, given the present state, X(s), is independent of the past, {X(u) : 0 ≤ u < s}.), we have to put conditions on the holding times.

So for holding times, little thought process reveals that it must have memoryless property and thus should be exponentially or geometrically distributed. To see this, suppose that X(t) = ἰ. Time t lies somewhere in the middle of the holding time Hi for state ἰ. The future after time t tells us, in particular, the remaining holding time in state ἰ, whereas the past before time t, tells us, in particular, the age of the holding time (how long the process has been in state ἰ). In order for the future to be independent of the past given X(t) = ἰ, we deduce that the remaining holding time must only depend (in distribution) on ἰ and be independent of its age; the memoryless property follows. Since an exponential distribution is completely determined by its rate we conclude that for each ἰ S, there exists a constant (rate) ai > 0, such that the chain, when entering state ἰ, remains there, independent of the past, for an amount of time Hi exp(ai). Thus a Continuous Time Markov Chain can simply be described by a transition matrix P = (Pij ), describing how the chain changes state step-by-step at transition epochs, together with a set of rates {ai : ἰ S}, the holding time rates. Each time state ἰ is visited, the chain spends, on average, E(Hi) = 1/ai units of time there before moving on.


Infinitesimal Generator (Transition Rate Matrix)

Assume here that Pi,i = 0 for all ἰ S. ai can be interpreted as the transition rate out of state ἰ given that X(t) = ἰ; the intuitive idea being that the exponential holding time will end, independent of the past, in the next dt units of time with probability aidt. More generally, for j ≠ ἰ, we additionally use the Markov property asserting that once leaving state ἰ the chain will, independent of the past, next go to state j with probability Pi,j to obtain:

$P_{i,j}(0) = a_i P_{i,j}$

When i = j,

$P_{i,j} (0) = -a_i$

So the matrix Q = P`(0) given by 2 equations above is called the transition rate matrix or infinitesimal generator, of the Markov Chain. . This is pretty similar to the derivative of the state transition matrix. We have the state transition matrix after Δt time and if we let Δt tend to 0 and if the limit exists, then we call this limit as infinitesimal generator, which is how this system evolve over infinite amount of time.

 

Chapman-Kolmogorov equations

For all t ≥ 0, s ≥ 0,

$P(t + s) = P(s)P(t)$

that is, for all t ≥ 0, s ≥ 0, ἰ S, j S

$P_{i,j}(t+s) = \sum_{k\in S}P_{i,k}(s)P_{k,j}(t)$

As for discrete-time chains, the (easy) proof involves first conditioning on what state k the chain is in at time s given that X(0) = ἰ, yielding Pik(s), and then using the Markov property to conclude that the probability that the chain, now in state k, would then be in state j after an additional t time units is, independent of the past, Pkj (t).

For continuous case we can just write down the derivative of P(t) over t which is going to be Q P(t)

${d\over dt}P(t)=Q\cdot P(t)$

The unique solution for the above differential equation is given by:-

$P(t) = e^{Q,t}$

Which can be expanded to:-

$P(t)=\exp(Q)=I+Q+{1\over 2!}Q^2+{1\over 3!}Q^3+\dots$


Kolmogorov’s forward and backward equations

From the Chapman-Kolmogorov equation we can derive the forward equation and the backward equation and the discrete time counterpart of this forward equation, and the backward equation.

Kolmogorov’s backward equation:

${d\over dt}P(t)=Q\cdot P(t)$

The word backward refers to the fact that in our use of the Chapman-Kolmogorov equations, we chose to place the h on the right-hand side in back, $P(t+h) = P(h) P(t)$  as opposed to in front, $P(t+h) = P(t) P(h)$ 

Kolmogorov’s forward equation:

${d\over dt}P(t)=P(t)\cdot Q$

Markov chain equivalent:

Kolmogorov’s backward equation:

$P_{t+1} – P_t = (P-I) \cdot P(t)$

Kolmogorov’s forward equation:

$P_{t+1} – P_t =P(t) \cdot (P-I) $

 

Uniformization

In DTMC with exponential transition times, the means of those transition times could vary from state to state. However, if it happened that these means were all the same, then we could represent the CTMC directly as a DTMC with transitions governed by an independent Poisson process, because in a Poisson process the times between transitions are IID exponential random variables. This situation may appear to be very special, but actually any finite-state CTMC can be represented in this way. We can achieve this representation by using the technique of uniformization, which means making the rates uniform or constant.

When the CTMC is in state ἰ, each of these potential transitions is a real transition if it moves to another state, while the potential transition is a fictitious transition if a transition from state ἰ back to state ἰ, meaning that we remain in state ἰ at that time, independently of past events. This uniformization construction requires that we change the transition matrix of the embedded DTMC. The new one-step transition matrix allows transitions from a state to itself, which is constructed by using the following equations:-

$ p_{ij} = \begin{cases} q_{ij}/\gamma &\text{ if } i \neq j \\ 1 – \sum_{j \neq i} q_{ij}/\gamma &\text{ if } i=j \end{cases} $

Where $ \gamma \geq \max_i |q_{ii}| $

So the state distribution with the above transition matrix is given by equation:-

$\pi(t) = \sum_{n=0}^\infty \pi(0) P^n \frac{(\gamma t)^n}{n!}e^{-\gamma t}$

Where $\pi(t)$ is state distribution at time $t$.

So we can sample next state according to above $P^n$.

Discrete State Markov Processes [09/15/2015]

4 thoughts on “Discrete State Markov Processes [09/15/2015]

      1. Avinav says:

        So lets suppose initially population is $X_0$ and $Y_0$ and the following event occurs: $E_1 \rightarrow E_3 \rightarrow E_2$

        Probability of this sample path will be, $P = (\alpha*\Delta t_1*|X_0|)*(\gamma*\Delta t_2*|Y_0|)*(\beta*\Delta t_3*|X_0+1|*|Y_0-1|) $

    1. wendong says:

      if you work with prey individuals, you use $\alpha*(\Delta t)$ on each of the $|X_{t-1}|$ individuals. if you work with prey population, you use $\alpha*(\Delta t)*|X_{t-1}|$ on the population. let us move further comments on this topic offline. the 3rd comment rephrases the 1st comment and the 4th comment (this one) rephrases the 2nd one.

Leave a Reply