ARIMA [09/17/2015]

Moving Average Process(MA)

Formal Definition MA(1) Process:

X_t = \epsilon_t + \theta \cdot \epsilon_{t-1}
where $X_t$ is state of the system at time t,
$\epsilon_t$ is some effect/noise at time t,
$\theta$ is factor constant
$\epsilon_t-1$ is some effect/noise at time t-1.

We assume that $\epsilon_t$ is I.I.D $\mathbf{N}(0, \sigma^2)$.

Example 1:

Let $X_t$ represents “change in demand for lemonade at time t” $\Delta\mathrm{lemon}_t$ and $\epsilon_t$ represents change in temperature. If MA(1) for the $\Delta\mathrm{lemon}_t$ is given by following formula:
\Delta\mathrm{lemon}_t = \epsilon_t – 0.5 \cdot \epsilon_{t-1}

If temperature increase at time t, $\epsilon_t > 0$ so $\Delta\mathrm{lemon}_t > 0$ . It means that demand for lemonade increases at time t. At time (t+1) $\epsilon_{t+1} = 0$ , first term will be zero but there will be negative effect.

So there will be decrease in demand for lemonade relative to yesterday. This may be because they already have the lemonade from yesterday, the ‘residue’.

Example 2:

Let $X_t$ represents “change in oil price at time t” $\Delta\mathrm{OilPrice}_t$ and $\epsilon_t$ represents hurricane at sea. If MA(1) for the $\Delta\mathrm{OilPrice}_t$ is given by following formula:
\Delta\mathrm{OilPrice}_t = \epsilon_t + 0.5 \cdot \epsilon_{t-1}

If there is a hurricane at time t, $\epsilon_t > 0$ so $\Delta\mathrm{OilPrice}_t > 0$. It means that oil prices increases (may be due to supply) at time t.
At time (t+1) $\epsilon_{t+1} = 0$ (no hurricane), first term will be zero but there will be half effect from previous state.

\Delta\mathrm{OilPrice}_{t+1} = 0 + 0.5 \cdot \epsilon_{t}

So there will be increase in oil price. This may be because supply still not able to recover properly from the last state.

At time (t+1) $\epsilon_{t+2} = 0$ (no hurricane), $\Delta\mathrm{OilPrice}_{t+2} = 0$ .
So we can see that effect from time t propagate to two states t and (t+1) respectively.

Example 3:

Let $X_t$ represents “number of shoppers at time t” $\mathrm{Shoppers}_t$ and $\epsilon_t$ represents shoppers coming at shop at time t. If MA(1) for the $\mathrm{Shoppers}_t$ is given by following formula:

\mathrm{Shoppers}_t = \epsilon_t + 0.5 \cdot \epsilon_t-1

if at time t shoppers are coming to the shop, $\epsilon_t > 0$ so $\mathrm{Shoppers}_t > 0$. It means that total number of shoppers increases at time t.

Properties of MA(1) Process: Stationary and Weakly Dependent
1. Expected Value must be time independent.
\mathbb{E}(X_t) = \mathbb{E}(\epsilon_t) + \theta \cdot \mathbb{E}(\epsilon_{t-1})
\mathbb{E}(\epsilon_t) = 0 \rightarrow \mathbb{E}(X_t) = 0

2. Variance must be time independent.
\mathrm{Var}(X_t) = \mathrm{Var}(\epsilon_t) + \theta^2 \cdot \mathrm{Var}(\epsilon_{t-1})
\mathrm{Var}(X_t) = \sigma^2 + \theta^2 \cdot \sigma^2

3. Covariance also must not be function of time t.
\mathrm{Cov}(X_t, X_{t-1}) = \mathrm{Cov}(\epsilon_t + \theta \cdot \epsilon_{t-1}, \epsilon_{t-1} + \theta \cdot \epsilon_{t-2})
\mathrm{Cov}(X_t, X_{t-1}) = \theta \cdot \mathrm{Cov}(\epsilon_{t-1}, \epsilon_{t-1}) = \theta * \sigma^2

for h > 1
\mathrm{Cov}(X_t, X_{t-h}) = \mathrm{Cov}(\epsilon_t + \theta \cdot \epsilon_{t-1}, \epsilon_{t-h} + \theta \cdot \epsilon_{t-h-1})

4. To prove weakly dependence we have to show that
\lim{h\to\inf} \mathrm{Corr}(X_t, X_{t-h}) = 0

from (3), we can see that $h > 1$, value is zero. Hence we can say that this MA(1) process is stationary and weakly dependent.

Auto regressive (AR) model:

An AR model is a model in which the value of a variable in one period is related to its values in previous periods.
The previous values included are called lags and are important for time series analysis. For e.g.: The money in your account in one month is related to the money in your account in the previous month.
The characteristic equation for AR (p) model is as follows:
y_t = c + \sum_{t=1}^p \gamma_t \cdot y_{t-1} + \epsilon_t

where, p = No. of lags or the number of previous values of the variable taken
c = constant
$\gamma_t$ = coefficient of the lag variable at the previous time
$\epsilon_t$ = error term at time t, Error term is included to compensate for white
Stationarity: A random variable that is a time series is stationary if its statistical properties are all constant over time. A stationary series has no trend, its variations around its mean have a constant amplitude, and it wiggles in a consistent fashion, i.e., its short-term random time patterns always look the same in a statistical sense.

Let’s define 2 functions for the time series:

Autocorrelation function (ACF): It is the ratio of covariance of $y_t$ and $y_{t-k}$ to their variances
\rho_k = \mathrm{Corr} (y_t, y_{t-k}) = \frac{\mathrm{Cov}(y_t, y_{t-k}}{\sqrt{\mathrm{Var}(y_t)\cdot \mathrm{Var}(y_{t-k}}}


Random Walk

Lets suppose a university student, Aayush, is terribly drunk and is unable to walk properly. Assume he can take a step either left or right only, and he chooses it randomly. The question is, on average, where will he be after time t with respect to the place where he started? Will he be on average around the origin, or will he be further away as time increases?

Random Walk is a process that follows:
X_{t} = X_{t-1} + \epsilon_t
\end{equation} , where $\epsilon_t$ = iid(0,$\sigma^2$)
which means that the current state is the previous state with some added noise.


Another simple example is a robot in a 1 D, that can move only left or right only in integer steps. It decides it action in a random manner, by tossing a coin. If its head it moves +1, if its tail, it moves -1. This tossing of coin is represented by $\epsilon$ (taking values +1, -1 ensures that expected value is 0).
$X_t$ is the position of the robot at any time t. $\epsilon_t$ is the result of the coin toss at that time.
Inititally robot is at origin, therefore $X_0 = 0$

With this setup, we can answer questions like what will be the probability that robot is at position, $X_5$ = +3 at time t = 5.

People have used this model on market prices:

“…The random walk theory states that market and securities prices are random and not influenced by past events….”
“…The random walk theory proclaims that it is impossible to consistently outperform the market, particularly in the short-term, because it is impossible to predict stock prices…”

Formal Definition

More generally,
Random Walk is an AR-1 process with $\rho = 1$. AR-1 process:
X_{t} = \rho \cdot X_{t-1} + \epsilon_{t}
,$ \epsilon_{t} = iid (0, \sigma^2)$
With $\rho = 1$,
X_{t} = X_{t-1} + \epsilon_{t}

Non Stationarity

Setting $\rho = 1$ , makes the process non-stationary. Let us see why, there are three conditions for stationarity,

  • Mean should be constant, $\mathbb{E}(X_t) = constant$
  • Variance should be constant $\mathbb{V}(X_t) = constant$
  • Covariance should be constant $\mathbb{C}(X_t,X_{t+h}) = constant $

Expanding the above equation:
X_{t} = X_{t-2} + \epsilon_{t-1} + \epsilon_{t}
X_{t} = X_{0} + \sum_{i=0}^{t-1} \epsilon_{i}

\mathbb{E}(X_t) = \mathbb{E}(X_0) + \mathbb{E}(\sum_{i=0}^{t-1} \epsilon_{i})

We know, (because of linearity)
\mathbb{E}(\sum_{i=0}^{t-1} \epsilon_{i}) = \sum_{i=0}^{t-1} \mathbb{E}(\epsilon_{i})
And, (because of our assumption $\epsilon = iid(0, \sigma^2)$)
$\mathbb{E}(\epsilon_i) = 0$

Therefore, $\mathbb{E}(X_t) = \mathbb{E}(X_0) = $constant (w.r.t time)
The expectation does not voilate the stationarity condition.

Similarly we can see that variance of $X_t$ :
\mathbb{V}(X_t) = \mathbb{V}(\mathbb{X_0}) + \sum_{i=0}^{t-1}(\mathbb{V}(\epsilon_{i})

$\sum_{i=0}^{t-1}(\mathbb{V}(\epsilon_{i}) = t \cdot \sigma^2$

Hence here the stationary condition is violated, as we can see that the variance vary with time.


ARIMA [09/17/2015]

Discrete State Markov Processes [09/15/2015]

We will start off with discussions about several discrete state Markov processes and talk about some settings and their applications.

1. Simplified Weather System

In this system we have latent states and observations. Latent states (as opposed to observable states), are states that are not directly observed but are rather inferred (through a mathematical model) from other states that are observed. In the weather system, the latent state is “sunny day” or “rainy day”. The observations are whether we see somebody carrying an umbrella or we do not see anybody carrying an umbrella. Therefore, given a set of observations, we can infer the latent states.

State transition kernel: The system evolves through a state transition kernel (or state transition matrix). Using this kernel, we can find the probability of a latent state for the next day, given today’s latent state. For example, if we know the probability for the next day to be a rainy day, conditioned on that today is a sunny day, then we know the probability for the next day to be a sunny day, conditioned on the current day is a sunny day.

. St+1 Rt+1
St a11 a12
Rt a21 a22

Here, we have a state transition matrix. What we know is a11 + a12 = 1 and a21 + a22 = 1.
Observation model: This is the probability conditioned on that today is a sunny day or a rainy day, we see somebody carrying an umbrella or we do not see anybody carrying an umbrella.

. No Umbrella Umbrella
S b11 b12
R b21 b22

Similarly, we have b11 + b12 = 1 and b21 + b22 = 1.

Although this is a very simple model, there are actually many versions of this. Let’s discuss some of its variants:

  1. Stock Market: In the stock market, we have bear market and ox market. We can have a model which says that if today is a bear market, then there’s a probability X that the next day is going to be a bear market.
  2. Coin Toss: We can mimic the scenario with a coin toss model, such that, if we throw a coin repeatedly and then when there’s a heads up we gain one dollar and if we have a tail up I lose one dollar. Here, gaining or losing are latent states and the results of the toss are observations.
  3. Hidden Markov model in speech recognition: In this model, we have, different vowels and consonants as latent states. And as observations, we have phonemes which people normally use. The question is that based on the observation of the phonemes, “What are the vowels behind those phonemes?” by exploring both our observations and the relationships of the vowels.
  4. DNA sequence: This model is an application of matching DNA sequence with protein and with RNA. We know that ATCG are the building blocks of a DNA sequence. Now we match them together, and predict the components of proteins. This is one example of a hidden Markov model.


2. Predator-Prey Dynamics

This system has predator and prey individuals and it evolves around the following three events.

  1. X -> 2X: A prey we will reproduce and generate two prey individuals. The rate for each prey to reproduce is α. Now, if there are |X| number of preys in the system, then the rate of this event would be α⋅|X|.
  2. X + Y -> 2Y: A prey individual and a predator individual interact and as a result, we have two predator individuals in the system. The rate (or the probability per unit time) for this event to happen is β. If there are |X| number of preys and |Y| number of predators, the total rate, i.e. the total number of events in a unit time, is going to be β⋅|X|⋅|Y|.
  3. Y -> 0: A predator individual disappears from the system.

These sets of events make this system a stochastic process. We know that given any sample path of a stochastic process, we should have a probability assigned to this sample path and we need to work with this probability. Now, if we have some observations about this system we should be able to find out the maximum likelihood as dimensions about a sample path that best fits the observations. The question that we want to solve here is, given observations, how can we find out the sample path?

Let’s start by sampling the equation in the first event. At time 0 we have X0 number of predators and Y0 number of preys. Here, we sample this prey multiplication event for each of the X prey individuals. At Δt time, the probability for each of the prey individual is α ⋅ Δt. There are X number of them so the total probability is going to be α ⋅ Δt ⋅ |X|.

Now we will sample the second equation. This means that for each combination, we first take a prey individual and take a predator individual and sample this event. The rate of success is β. For each prey-predator combination we have β ⋅ Δd probability for this event to happen. Since there are X number of preys and Y number of predators, the total probability is going to be β ⋅ Δd ⋅ |X| ⋅ |Y|.

Finally, we sample the last equation. Sample Y to empty set and similarly we know that the probability is going to be γ ⋅ Δd ⋅ |Y|.

If Δt is small enough, then the probability for us to get more than one event at Δt is going to be very small. We can approximately think that there is only one event. Although we have made three different samples (could have been more if we had more than 3 events), we can approximately think that there will be only one event happens. Then, according to the event we update the state from X0 to X1 and from Y0 to Y1.

Ultimately, when we sample event one Δt, then X1 is going to be X0 plus one. If we have sampled event two, then X1 is going to be X0 – 1 and Y1 is going to be Y0 + 1. Similarly, if event three happens, then X1 = X0 and Y1 = Y0 – 1.

If we assign probabilities (as discussed above) corresponding to each sequence of predators and preys, we will have a probability assigned to this system. Although this is a pretty complicated system, the trick is that if we introduce an event and introduce the probability that event into this system, then we will be able to find out the probability of any sample path.

The secret way to sample this process without taking approximation is the following. First, we sample the time for the prey multiplication to happen. And then we sample the time for the predation event to happen and then we sample the time for the third event to happen and then take the minimum. After we get the minimum, we update the state of the system, the population, and then we alter the time. From that time we again sample the time to the next three events and then take the minimum and then update the system state again.

But this way of sampling is quite inefficient in the sense that at each time we sample three exponential distributions. One solution to this problem is to use events to define a probability and to assign a probability measure to the system.


3. SIS Epidemic Dynamics

In this system we have susceptible individuals and infectious individuals, with the following three events:

  1. S+I→2I, rate = β|S||I|: In this event, whenever a susceptible individual and a infectious individual meet in a dynamic social network through close contact, then there’s a probability β that the infectious individual will turn the susceptible individual infectious.
  2. I→S, rate = γ|I|: This event pertains the recovery of the infected individual.
  3. S→I, rate = α|S|: In this event a susceptible individual might get infected from somewhere outside of this system. This describes the condition that we do not cope with the complete or isolated systems.

Using these three parameters, we can write down the probability for any sample path.

$ \begin{eqnarray} && P\left({X_{n,t},Y_{n,t}:n,t},\alpha,\beta,\gamma\right)\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}P(X_{n,t+1}|\{X_{n’,t}:n’\},\alpha,\beta,\gamma)\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}\gamma^{1_{X_{n,t}=1\wedge X_{n,t+1}=0}}\cdot (1-\gamma)^{1_{X_{n,t}=1\wedge X_{n,t+1}=1}}\cdot \\ &&\left[\alpha+\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=1}}\cdot \left[1-\alpha-\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=0}}\cdot\\ &&\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ \end{eqnarray} $
Here we have individuals indexed by n and the time index by t. S represents the latent state and X(n, t) is either 0 or 1. If it’s 0 then this means that a person n is susceptible at time t, and if this is 1 this means that a person is infectious at time t. Y represents different symptoms for this individual n at time t. If we think about it in a Bayesian kind of way, then we have the probability distribution for the rate α, for rate β and for rate γ.

This is just one way of modeling the dynamics of the diffusion, this means that we’ll be able to search in the network and find out, for example, spreading of someone rumor that best agrees with our observation about, for example, what we have scraped from the internet.


4. Exponential random graph model (ERGM)

In this system we will talk about the dynamics of networks. We know that in a social network setting, we form friends and if A and B do not get in contact for quite some time, then the relationship between A and B will break. Another thing that we have is, if A and B are friends and if B and C are friends, then it is quite likely for A and C to be friends. There are many ways for us to form a social network. Another way to form a social network is that, if all of us take the same class and take the same course, then this means that we are more likely to become friends. The question here is how do we express this kind of complex dynamics about network formation as an evolution of stochastic processes and assign a measure?

One way to think about this is to just look at the snapshots that we have, rather than looking at it as a dynamic process. For example, for a social network involving a specific set of faculty staff and students in the Science department, what is the probability for this snapshot of a network? We can think about this problem in terms of how the network evolves through a set of events. First, we look at the random variable as the state of this system. If we think about this network of n nodes as a directed network then we will have n ⋅ n number of edges. Each edge can take either 0 or 1 and at each time we could add an edge or remove an edge. And there’s a probability for both of these events to happen. Similarly, by introducing events and the probabilities for these events to happen, we can define a stochastic process.

Now we look at the dynamics version of the system. Here we have a network X as indexed by time t. So sample the time to the next event to happen, which is an exponential distribution as we have seen in the previous predator-prey example and conditioned on that event to happen, we randomly sampled directed pairs which is i and j. That is, we just randomly sample an edge from the n ⋅ n edges. After that we decide according to the result of throwing a coin. Depending on to how we throw this coin, we can have a probability of either adding this edge or removing this edge. Therefore, as a result, even for a complicated dynamics such as the dynamics of the formation of a network we can still assign a probability measure.

Now we will discuss the equilibrium distribution of this kind of dynamics. If we let the network evolve for an infinite amount of time, then at the end, what is the probability distribution like? This probability distribution is normally exponential distribution in the sense that from now to the infinite future, we lose all information about what is the state of now.

Now, the probability distribution P is:
$ \begin{eqnarray} && \mathbf P(X_{i,j}=1|\underbrace{X\backslash X_{i,j}}_{\tiny X\mbox{ except }X_{i,j}}) = {P(X=\Delta_{i,j}^+x) \over P(X=\underbrace{\Delta_{i,j}^+x}_{\tiny x\mbox{ except }x_{i,j}=1})+P(\underbrace{X=\Delta_{i,j}^-x}_{\tiny x\mbox{ except }x_{i,j}=0})}\\ &\Rightarrow& \log{P(X_{i,j}=1|X\backslash X_{i,j})\over P(X_{i,j}=0|X\backslash X_{i,j})} = \sum_k \theta_k \left(z_k(\Delta_{i,j}^+x)-z_k(\Delta_{i,j}^-x)\right)\\ && \mbox{where }\delta^+_{i,j}z_k(x)=z_k(\Delta_{i,j}^+x)-z_k(\Delta_{i,j}^-x)\mbox{ is called change statistic.} \end{eqnarray}$

We claim that the equilibrium distribution satisfies:

$P(x)\propto \exp\left(\sum_i \theta_i z_i(x)\right)$

$z_i(x)$ is a feature. If we add in the normalization constant we will get:

$P(x)\propto \exp\left(\sum_i \theta_i z_i(x)\right) – \ln( \exp\left(\sum_i \theta_i z_i(x)\right))$

We can see that, if we can define a set of events, then we can assign probability measure to very complicated processes.


Stochastic Matrix

A (n X n) matrix is called a Stochastic matrix if all entries are non negative and the sum of each column vector is equal to 1.

For a stochastic matrix, every column is a stochastic vector.

If p is a stochastic vector and A is a stochastic matrix, then A*p is a stochastic vector.

Proof: Let v1, v2, ….., v n be the column vectors of A. Then,

Ap = p1 v1 + p2 v2 + ……… + pn vn

By summing, p1 + p2 + …. + pn = 1

A Stochastic matrix has an eigenvalue 1. All other eigenvalues are in absolute value smaller or equal to 1.

Proof: For the transpose matrix AT, the sum of row vectors is equal to 1. The matrix AT has the eigen vector
| 1 |
| 1 |
| 1 |
| 1 |

Because A and AT have the same determinant, also A – λIn and AT – λIn have the same determinant so that the eigenvalues of A and AT are the same. With AT having an eigenvalue 1, A also has an eigenvalue 1.

Assume now that v is an eigenvector with an eigenvalue | λ | > 1. Then An v = | λ |n v has exponentially growing length for n → . That implies that there is for large n one coefficient

[An]ij which is larger than 1. But An is a stochastic matrix and has all entries <= 1. The assumption of an eigenvalue larger than 1 can not be valid.


Time Reversed Markov Chain

Consider a Markov chain: ….,Xn-2 , Xn-1 , Xn , …..

Tracing the Markov chain backwards: ….,Xn , Xn-1, Xn-2 , …..

{Xn-i , i=0, 1 ,…} is also a Markov Chain.

Let the limiting probabilities be Πi .

For Markov chains, given the present, the past and future are independent.

“past” → “ future” , “future” → “past”

Given the present, the future and past are independent. Hence, reverse process is also a Markov chain.

Transition probabilities of reversed Markov Chain:

$Q_{i,j}={P(X_m=j | X_{m+1}=i)} = {P(X_m=j , X_{m+1}=i)\over P(X_{m+1}=i)} \\ ={P(X_m=j ) P(X_{m+1}=i | X_m = j)\over P(X_{m+1}=i)} \\ = Π_j . P_{i,j} / Π_j \\$

The reverse Markov chain has the same transition probability matrix as the original Markov chain.

A Markov chain is time reversible if Qij = Pij that is Πj Pji = Πi Pij .

P(seeing a transition from j to i) = P(seeing j) P(transit to i | j).

Πj Pji = Πi Pij means the probability of seeing a transition from j to i is the same as seeing a transition from i to j. A transition from j to i for the original Markov chain is a transition from i to j for the reversed Markov chain.


Stationarity and Equilibrium

A stationary distribution (also called an equilibrium distribution) of a Markov chain is a probability distribution Й = Й . P where P is a one-step probability matrix of a Markov chain. The state probability distributions do not change after a transition.

If a chain reaches a stationary distribution, then it maintains the distribution for all future time.

A stationary distribution represents a steady (equilibrium) state in the chain’s behavior.

Markov chain is stationary from time t is P(Xt = I) = P(Xt+1 = i) that is,

(P(Xt = i))i = ((P(Xt = i))i . (pi,j)i,j

Time Reversible Markov chain is stationary.

P(Xt+1 = i) = j P(Xt = j) pj,i = P(Xt = i)pi,j = P(Xt = I)

The converse of this is not true.

If a Markov chain has a limiting distribution Pt → P, the limiting distribution is called equilibrium distribution. Equilibrium distribution is stationary distribution whereas the converse is not true when stationary distribution is not unique.


Continuous State Markov Chain

As in the case of discrete state Markov Chain which stays in a particular state for exactly one unit of time, we relax this property to allow chain to spend continuous amount of time in any state but in such a way as to retain the properties of Markov Chain.

Assume throughout that our state space is S = Z = {· · · , −2, −1, 0, 1, 2, · · · } (or some subset thereof). Suppose now that whenever a chain enters state ἰ S, independent of the past, the length of time spent in state ἰ is a continuous, strictly positive (and proper) random variable Hi called the holding time in state ἰ. When the holding time ends, the process then makes a transition into state j according to transition probability Pij , independent of the past. So if we take ∆t to be infinitesimally small we will get a continuous state Markov Chain. Letting X(t) denote the state at time t, we end up with a continuous-time stochastic process {X(t) : t ≥ 0} with state space S.

To make sure that a Continuous Process satisfies the Markov property(The future, {X(s + t) : t ≥ 0}, given the present state, X(s), is independent of the past, {X(u) : 0 ≤ u < s}.), we have to put conditions on the holding times.

So for holding times, little thought process reveals that it must have memoryless property and thus should be exponentially or geometrically distributed. To see this, suppose that X(t) = ἰ. Time t lies somewhere in the middle of the holding time Hi for state ἰ. The future after time t tells us, in particular, the remaining holding time in state ἰ, whereas the past before time t, tells us, in particular, the age of the holding time (how long the process has been in state ἰ). In order for the future to be independent of the past given X(t) = ἰ, we deduce that the remaining holding time must only depend (in distribution) on ἰ and be independent of its age; the memoryless property follows. Since an exponential distribution is completely determined by its rate we conclude that for each ἰ S, there exists a constant (rate) ai > 0, such that the chain, when entering state ἰ, remains there, independent of the past, for an amount of time Hi exp(ai). Thus a Continuous Time Markov Chain can simply be described by a transition matrix P = (Pij ), describing how the chain changes state step-by-step at transition epochs, together with a set of rates {ai : ἰ S}, the holding time rates. Each time state ἰ is visited, the chain spends, on average, E(Hi) = 1/ai units of time there before moving on.

Infinitesimal Generator (Transition Rate Matrix)

Assume here that Pi,i = 0 for all ἰ S. ai can be interpreted as the transition rate out of state ἰ given that X(t) = ἰ; the intuitive idea being that the exponential holding time will end, independent of the past, in the next dt units of time with probability aidt. More generally, for j ≠ ἰ, we additionally use the Markov property asserting that once leaving state ἰ the chain will, independent of the past, next go to state j with probability Pi,j to obtain:

$P_{i,j}(0) = a_i P_{i,j}$

When i = j,

$P_{i,j} (0) = -a_i$

So the matrix Q = P`(0) given by 2 equations above is called the transition rate matrix or infinitesimal generator, of the Markov Chain. . This is pretty similar to the derivative of the state transition matrix. We have the state transition matrix after Δt time and if we let Δt tend to 0 and if the limit exists, then we call this limit as infinitesimal generator, which is how this system evolve over infinite amount of time.


Chapman-Kolmogorov equations

For all t ≥ 0, s ≥ 0,

$P(t + s) = P(s)P(t)$

that is, for all t ≥ 0, s ≥ 0, ἰ S, j S

$P_{i,j}(t+s) = \sum_{k\in S}P_{i,k}(s)P_{k,j}(t)$

As for discrete-time chains, the (easy) proof involves first conditioning on what state k the chain is in at time s given that X(0) = ἰ, yielding Pik(s), and then using the Markov property to conclude that the probability that the chain, now in state k, would then be in state j after an additional t time units is, independent of the past, Pkj (t).

For continuous case we can just write down the derivative of P(t) over t which is going to be Q P(t)

${d\over dt}P(t)=Q\cdot P(t)$

The unique solution for the above differential equation is given by:-

$P(t) = e^{Q,t}$

Which can be expanded to:-

$P(t)=\exp(Q)=I+Q+{1\over 2!}Q^2+{1\over 3!}Q^3+\dots$

Kolmogorov’s forward and backward equations

From the Chapman-Kolmogorov equation we can derive the forward equation and the backward equation and the discrete time counterpart of this forward equation, and the backward equation.

Kolmogorov’s backward equation:

${d\over dt}P(t)=Q\cdot P(t)$

The word backward refers to the fact that in our use of the Chapman-Kolmogorov equations, we chose to place the h on the right-hand side in back, $P(t+h) = P(h) P(t)$  as opposed to in front, $P(t+h) = P(t) P(h)$ 

Kolmogorov’s forward equation:

${d\over dt}P(t)=P(t)\cdot Q$

Markov chain equivalent:

Kolmogorov’s backward equation:

$P_{t+1} – P_t = (P-I) \cdot P(t)$

Kolmogorov’s forward equation:

$P_{t+1} – P_t =P(t) \cdot (P-I) $



In DTMC with exponential transition times, the means of those transition times could vary from state to state. However, if it happened that these means were all the same, then we could represent the CTMC directly as a DTMC with transitions governed by an independent Poisson process, because in a Poisson process the times between transitions are IID exponential random variables. This situation may appear to be very special, but actually any finite-state CTMC can be represented in this way. We can achieve this representation by using the technique of uniformization, which means making the rates uniform or constant.

When the CTMC is in state ἰ, each of these potential transitions is a real transition if it moves to another state, while the potential transition is a fictitious transition if a transition from state ἰ back to state ἰ, meaning that we remain in state ἰ at that time, independently of past events. This uniformization construction requires that we change the transition matrix of the embedded DTMC. The new one-step transition matrix allows transitions from a state to itself, which is constructed by using the following equations:-

$ p_{ij} = \begin{cases} q_{ij}/\gamma &\text{ if } i \neq j \\ 1 – \sum_{j \neq i} q_{ij}/\gamma &\text{ if } i=j \end{cases} $

Where $ \gamma \geq \max_i |q_{ii}| $

So the state distribution with the above transition matrix is given by equation:-

$\pi(t) = \sum_{n=0}^\infty \pi(0) P^n \frac{(\gamma t)^n}{n!}e^{-\gamma t}$

Where $\pi(t)$ is state distribution at time $t$.

So we can sample next state according to above $P^n$.

Discrete State Markov Processes [09/15/2015]

Concepts from Probability Theory [09/03/2015]

In this lecture we will review some concepts from probability theory. Since we are using probability theory, and in particular we’re using the concept of stochastic process to understand our complex social systems. I think it is better that we can reach some common ground about the concepts of probability theory. Today we’ll begin to involve some math so please bare with me about that. The most important thing is that we use math to do different things, to do important things. My understanding is that there are different people doing different things. There are mathematicians, and there are physicists, and there are engineers.

All people do math but they do math in different ways. If we think about how mathematicians do math, they treat math as a string of symbols. What they do is, they set up some actions and they set up some rules of derivation then they foil the rules then they get pages after pages of symbols and that’s it. That’s how mathematicians do math. They don’t care about the interpretation because once they interpret this then it’s metamath, it’s not longer math. You may ask why they want to do that, just play with the symbols just like artists play with their brush. The reason is that by doing this, they will no longer be constrained by our real world so they’re in a virtual world. Although it’s not very easy, they gain new freedom in their thought experiment in doing new things because in the real world, they are constrained by forces and by physical laws. But in a virtual world defined by the actions and by the laws, they no longer have such constraints so they gain new freedom although it is not easy.

There are also physicists. Physicists also work with math but physicists work with math in a different way. A physicist is always very interested about the intuition about the math laws and the theorems. Whenever they have something, they ask, “What is the intuition and can we visualize that?” What is the force and what is acceleration? The emphasis of physicists is they want to find the intuition, they want to use math to assist them in their understanding of physical laws. This is their way of working with math.

Engineers work still in a very different way. Engineers tend to solve problems with math. This is what we do. One thing that I can say is that whenever we work with math, it’s always that we just build things. I think the most important thing is that we understand how we build those conclusions with math and I will show you the way of building the conclusions. First, it’s nothing that is not understandable. I think that we’re more interested in using math to solve our problems.

Sample Space, Events and Field of Events

We have heard a lot about probability as random events. Let’s first have some common ground on what is sample space and event, what is a field of events, what is expectation and conditional expectation.

Let’s throw a dice twice. What is the sample space? A sample space is the space of all outcomes ($\Omega$). From the sample space, we might outcome $(1,1)$, meaning we get a one from the first throw and get another one from the first throw, and so on and so forth. The size of the sample space is $6\times 6=36$.

If we continue to throw a dice then we get more and more information about the sequence.

  • Before we throw the dice, the only events that we can talk about is the set of all possible outcomes or its complement the empty set ($(\emptyset, \Omega)$).
  • After we throw the dice once, we get some information — the information about the first throw: We could get 1, 2, 3, 4, 5 or 6, which could be an even or odd number, or a prime or composite number. Those are all examples of events about the first throw. How large will it be the set of events from the first throw? — It is $2^6$: The event could include 1 or exclude 1, and it could include 2 or exclude 2, and so on. Different inclusions and exclusions give different events. If we consider whether or not this event include 1 then we have two choices. Then if we consider whether the event include 2, we have another two choices. In total, we will have $2^6$ different choices. Those are all different events. The set of events after the first throw includes the set before the first throw. (Verify it!)
  • When we throw this dice a second time, we can talk about the outcome of not only the first throw but also the second throw. The set of events after the first throw is a subset of the events of the second throw, which is a subset of the events in the third throw, and so on. (Verify it!) This is how, as we conduct more and more experiments, we’ll gain more and more information about this system and we will have larger and larger sets of events.

The sample space is $\Omega$. An event is any subset of $\Omega$. An atomic event is a event that cannot be subdivided, a compound event is a union of atomic events, the complement of an event is  the case that the event doesn’t happen. For example, after the first throw, the atomic events are all the numbers we could read from the dice: $\{1,2,3,4,5,6\}$, a even number is a compound event and its complement is an odd number. The set of all events, which is a set of sets, form a field. We call something a field if the following conditions are satisfied.

  • The empty sets should belong to the field.
  • The sample space $\Omega$ should belong to the field.
  • If we have two events $A$ and $B$, the union of the two events (either A or B), the intersection of the events (both A and B), and the subtraction (A but not B) are also events. For example, one event could be that the outcome is an odd number. Another event could be that the outcome is a prime number, then we can talk about odd prime number, odd or prime number, and a number that is odd and is not prime.

Random Variables

Up to now, we have defined a sample space, atomic events, compound events, and field of events. We will proceed to define random variables or functions defined on a field of events.

In the experiment of throwing the dice twice, we can define a function from our sample space which  $\{1,\cdots,6\}\times \{1,\cdots,6\}$, it’s a tuple. We can define a function on this sample space, meaning that we can assign numerical values to different outcomes. An outcome could be $(1,1)$, for example and then we set a function that is the sum of the throws $1+1=2$. Or we could define a characteristic function which equals 1 if the number yielded by the second throw is 3 times the number yielded by the first throw, and 0 otherwise. There are many different ways to define a function on a sample space of finite number of outcomes.

A random variable defines a field. Let us inspect a random variable $X: \{1,2,3,4,5,6\}\to\{0,1\}$ which is the characteristic function for an outcome to be an even number $X(\omega)=(\omega+1) \mod 2$. So, for $\omega$ taking 1,2,3,4,5,6 an $X(\omega)$ takes 0,1,0,1,0,1. What is the field defined by $X$? According to our definition of a field, we have to add in empty set $\emptyset$. We have to add in the sample space $\{1,2,3,4,5,6\}$. We have to add in event $\{X=0\}=\{\omega: X(\omega)=0\}$. The outcomes 1,3 and 5 belong to this set. We have to add in event $\{X=1\}$, which contains the outcomes 2,4, and 6. We can verify that this is the field generated by X: The empty set and the world set $\Omega$ belong to the set of sets. For any event $A$ and $B$, $A \cup B$, $A\cap B$ and $A\\  B$ also belong to a set.

In addition, there are many ways to define a field on the sample space, a random variable may be on field set or maybe not on this field. For example, we have a field of events defined by $X(\omega)$ — that $\omega$ is a odd number or a even number. Is random variable $Y(\omega)=\omega$ defined on the field generated by $X$? It is not because the atomic events ($\{1,2,3,4,5,6\}$) are not events generated by $X$. The field generated by $Y$ includes and is larger than the field generated by $X$. If we look at the field subset relationship in the other direction, we find that the events generated by X, which is $\{X=0\}$, $\{X=1\}$, the whole set $\Omega$, and the empty set $\emptyset$, all belong to the events generated by Y. In this incidence we say that random variable $X$ is $Y$ measurable or is the random variable defined on Y.

For another example, we can define a random variable as the sum of two throws. This sum defines a field generated by events $\{X_1+X_2+=2\},\cdots,\{X_1+X_2+=12\}$. (Verify this!)

It is a little complicated to define a random variable as a function defined on a field rather than to directly talk about atomic events and assign a probability measure. It is actually necessary if we want to deal with complicated randomness in reality. For example  the world economics as our sample space is too complex to be modeled. A random variable X such as “whether the stock price of Apple will rise or fall tomorrow” could be much simpler and manageable. X is a map from the very complicated world into a much simpler one, and X does not capture all randomness about the world economics. To see that X doesn’t capture all randomness in the world of economics, we can define a random variable Y, which is “whether the funding in my retirement account is going to rise or fall tomorrow”, and which is likely independent of random variable $X$ and not defined on the field generated by $X$. In contrast if we confine ourselves to only work with the values of X as our sample space, we cannot describe something as complex as world economics. In addition since X captures all randomness that we know of, we cannot defined a random variable $Y$ from the sample space defined by the value of $X$ that is independent of $X$. That is the reason why we bother to first define omega which is sample space and then define a random variable X which maps this sample space into a simpler space. We can start from some very complex world and we chip something out to study which is simple and manageable.

Probability, Expectation & Conditional Expectation

A probability measure is positive numbers assigned to the atomic events in a sample space. The sum of the probabilities assigned to the atomic events should sum up to 1 because we do not want to leave out something.

Let’s assign equal probability the 36 possible outcomes from two throws of a dice. The probability for each outcome is 1/36. Now that we have assigned probability to all of the atomic events, we can talk about the probability assigned to compound events, which is the sum of the probabilities assigned to its elements. For example, what is the probability that the sum of the two throws is 2? It is 1/36 because the only possible atomic event to generate a 2 is $(1,1)$. Similarly the probability for the sum of the two throws to be 3 is 2/36, and the probability for the sum to be 4 is 3/36.

After we have assigned probability to the sample space, we can estimate the expectation and variance of a random variable. (A random variable is a function that assigns a value to an outcome of a experiment.) What is the probability mass function of $X$, that is the sum of two throws, supposing that every one of the 36 outcomes are equally likely? The outcome of the sum of the two throws is from 2 to 12 (where 12 corresponds to $(6, 6)$), with each outcome $x$ assigned a probability that is the number of occurances of $x$ in the following table divided by 36. The sum of the probabilities sum up to 1.

+ & 1 & 2 & 3 & 4 & 5 & 6\\
\hline 1 & 2 & 3 & 4 & 5 & 6 & 7\\
2 & 3 & 4 & 5 & 6 & 7 & 8\\
3 & 4 & 5 & 6 & 7 & 8 & 9\\
4 & 5 & 6 & 7 & 8 & 9 & 10\\
5 & 6 & 7 & 8 & 9 & 10 & 11\\
6 & 7 & 8 & 9 & 10 & 11 & 12
What is the expectation of X? The expectation of this random variable is the weighted sum over all possible outcomes of X, where the weight is the probability ($E(X)==\sum_\omega X(\omega)P(\omega)=\sum_x x\cdot p(x)$). If we sum these together, we get 7. What is the variance of X? The variance of X is the weighted square of the difference between X to the average value of X, where the weight is the probability($\mbox{Var}(X)=\sum_x (x-E(X))^2\cdot p(x)$). We can show that $\mbox{Var}(X)=E(X^2)-(E(X))^2$.

Similarly, if we define a random variable $g\circ X$ by function composition, where random variable X is already a function. For example, if X is identity function $X(\omega)=\omega$, $g\circ X(\omega)=2\times\omega$ for $g(X)=2\times X$, and$g\circ X(\omega)=\omega^2$ for $g(X)=X^2$. We can estimate the expectation of function composition $E(g\circ X) = \sum_\omega g(X(\omega))p(\omega)$, where $\omega$ takes atomic events in the sample space. There is a simpler way to estimate expectation: $g\circ X=\sum_x g(x) p(x)$, because the field generated by $X$ is a subfield of the field generated by all $\omega\in \Omega$.

Now we talk about conditional expectationconditional probability and the conditional independence. Recall that we assigned probability 1/36 to each of those 36 events. From this probability assignment it follows that the probability for the first throw and the probability for the second throw are independent. Here is the reason. We know that $P((m,n))={1\over 36}$, $P((m,\bullet))={1\over 6}$ and $P((\bullet,n))={1\over 6}$ by our setup. Hence $P((m,n))=P((m,\bullet))\times P((\bullet,n))$. This means that given the result of the first throw does not help us much to make inference about what will be the result of the second throw. That’s the first throw and the second throw as independent based on our assignment of 1/36 to the atomic events.

For another example, conditioned on that is the sum of the two throws is 3, the probability of the first throw to yield 1 is 1/2 and the probability of the first throw to yield 2 is also 1/2. The reason is that we only have two cases that we can yield a sum of 3, which is $(1,2)$ and $(2,1)$.

Below are several important properties of expectation and variance:

  • Expectation of linear combination Let $X_i$ be random variables, then $\bf E(a_0+a_1 X_1+\dots + a_n X_n) = a_0 + a_1\bf E(X_1) + \dots + a_n \bf E(X_n)$. (Why?)
  • Expectation of independent product Let $X$ and $Y$ be independent random variables, then $\bf E(X Y) = \bf E(x)\bf E(Y)$. (Why?)
  • Law of total expectation Let $X$ be integrable random variable (i.e., $\bf E(|X|)<\infty$) and $Y$ be random variable, then $\bf E(X) = \bf E_Y(\bf E_{X|Y}(X|Y))$. (Why?)
  • Variance of linear combination Let $X_i$ be random variables, then $\bf{Var}(a_0+a_1 X_1+\dots + a_n X_n) = a_0^2 + a_1^2\bf{Var}(X_1) + \dots + a_n^2 \bf{Var}(X_n)$
  • $\bf{Var}(X) = \bf E(X^2) – {(\bf E(x))}^2$

A stochastic process evolves and we get more observations about it over time. The future observations are conditioned on the current and the past observations. That is how this conditional probability and conditional expectation emerge and why it is important in studying a stochastic process. Let us work with conditional expectation in a finite sample space $\Omega$. The field of events is the power set $2^\Omega$ or the set of all subsets of $\Omega$. We are given a field $\mathcal G$ generated by a partition $(D_1,\cdots,D_k)$ ($\cup_i D_i=\Omega$, $D_i\cap D_j=\emptyset$ for $i\ne j$). We are given a random variable X which assigns the atomic events to the atoms to a number $x_1,\cdots, x_p$. The conditional probability of a event A given event D is the probability of A and D divided by the probability of D, or the probability for A and D to both happen conditioned on that probability D happened. For example the conditional probability for the sum of two throws to be even ($X_1+X_2 \mod 2=0$) conditioned on that the first throw is a prime number ($X_1$ is prime). The conditional probability of event A given a field $\mathcal G$ is the probability of an outcome $\omega$ to be in $A$ conditioned on that we know which $D_i$ this $\omega$ belongs to: $P(A|\mathcal G)(\omega)=\sum_i P(A|D_i)I_{D_i}(\omega), where I is an indicator function and $d_1$ to $d_k$ make a partition of the whole sample space. So conditional probability is a random variable. For example the conditional probability for the sum of two throws to be a even number conditioned on that we know whether the first throw is a prime number or a composite number — The conditioned random variable (or field) represents things we know and the conditioning variable represents things we do not know.  Similarly we can also define conditional expectation of a random variable given a field $E(X|\mathcal G)=\sum_i x_i P(X=p_i|\mathcal G)$, and a conditional expectation is also a random variable.

Continuous Probability Space

We talked about how people work on a space with finite number of values. Now we move from finite to infinite. The way that we do this is actually very simple: We approximate the infinite in terms of finite, and take the limit. That is, we aim to approximate the probability of a measurable set in a continuous space better and better when the number of “tiles” to cover this measurable set goes to infinite. Similarly, we approximate a measurable functions with a finite sum of indicator functions of measurable sets and aim to improve our approximation of the continuous function as we sum over more and more indicator functions. A random variable is just a measurable function that maps a probability space into a measurable space.

When we move from a sample space with finite outcomes to one with continuous outcomes, we need to work with a sigma field. A sigma field is a field with the capability to move us from finite to infinite.In a $\sigma$-field, disjoints measurable sets are countably additive, meaning that we can add countably number of sets.

Let us sample a standard normal distribution using $X=\sum_{i=1}^{12}(U_i-{1\over 2})$, where $U_i$ takes uniform distribution on $[0,1]$. We can show that $X$ has mean 0, standard deviation 1, and probability density function very close to standard normal distribution.

Our sample space here is 12 dimensional $\prod_{i=1}^{12}[0,1]$. $X$ maps the 12 random variables of uniform distributions to the real line. We can define probability density function and cumulative probability function for real-value random variable $X$.

Bayesian Theorem

Bayesian theorem enables us to turn the conditional around. We have worked with Bayesian theorem in the last lecture about predicting whether from the daily activity of my girlfriend. Here we have another one. Suppose we have a very rare disease and this disease is pretty serious so we want to screen this disease in advance. Let $p$ represents positive clinical test and $d$ represent having disease. Let the probability of positive clinical test conditioned on disease be $P(p|d)=.98$ and let the probability of positive clinical test conditioned on no disease be $P(p|\Omega\\d)=.02$, and let the prior probability of disease be $P(d)=.0001$. If I’m diagnosed to have some positive clinical test result, what is the probability that I have a disease? We will use Bayesian theorem to turn the conditional around, from $P(p|d)$ to $P(d|p)$. It turns out that if I have positive clinic test, the probability for me to have this disease is still pretty small, .1%, which is still not much to worry about. On the other hand, since the prior probability is .001, a positive clinic test increase the probability of having disease by 10-fold.

What is the relationship between mutual independence and pairwise independence? Mutual independence implies pairwise independence but not the other way. In the following example, we have 2 out of 4 faces in red, green and blue, hence $P(r)=P(g)=P(b)={1\over 2}$. We have 1 out of 4 faces in red and green, green and blue, and red and blue, hence $P(rg)=P(gb)=P(br)={1\over 4}$ so we have pairwise independence. We have 1 out of 4 faces in red, green and blue, hence $P(rgb)={1\over 4}$ But since $P(rgb)={1\over 4}\ne P(r)P(g)P(b)={1\over 8}$, we do not have mutual independence.


Concepts from Probability Theory [09/03/2015]

Introduction [09/01/2015]

What is Simulation Modeling and Why do We Need it?

In this course, we’re going to talk about simulation modeling. What is simulation modeling? We have a world. We do a lot of things to make this world a better world: We build roads, we build cities, and we fight with diseases, and so on. Those are pretty expensive human endeavors because those activities may cause many man-months associated with millions of dollars, and they may have substantial consequences.

How can we make such activities efficient and error-free as much as we can? We do simulation modeling: First we find out the basic patterns of this real world. Then we construct a virtual world that duplicates the real world following the same set of rules. Then we simulate this virtual world, and we conduct thought experiments and reason about how we can improve this virtual world. After we finally find a way to improve this virtual world out of many trials and errors, we can proceed to implement this way in our real world. Why do we think this strategy to improve the virtual world will also improve the real world? The reason is that we have set up the virtual world to run under the same set of rules as the world world. This 1-to-1 map between the virtual world and the real world is our guarantee that whatever runs in the virtual world will be followed by the real world and vice versa. This type of reasoning doesn’t always work — We’re human beings and we make mistakes. We reflect on our past, plan our future, and conduct thought experiments at all times.

Simulation modeling has been widely applied by researchers from different disciplines to assist their scientific reason. We computer scientists need to understand the applications of simulation modeling to design algorithms and help researchers from other disciplines in their thought experiments. In systems biology, it finds its application to model the stochasticity of the chemical reactions happening at the cell level. According to Wilkinson, “modeling is attempt to describe in a precise way an understanding of the elements of a system of interests, their states, and their interactions with other elements” and it “should be sufficiently detailed”. Joshua Epstein — social scientist working on computational epidemics other dynamical systems would conduct research on generative social science by “situating an initial population of autonomous heterogeneous agents in a relevant spacial environment, allow them to interact according to simple local rules, and thereby generate or grow the microscopic regularity from the bottom up”. His approach is derived from from the approach in solid matter and statistic physicists. The idea is that we can make macroscopic observations about our system, such as temperature, or pressure and volume. We want to understand what’s going on there and how we can have pressure, and temperature, and volume. Our approach is simulate modeling: We try to construct the microscopic measure of our system from the microscopic interactions of the atoms. If we can construct the macroscopic phenomena from microscopic interactions of the atoms, we have a hypothesis about the dynamics of this system.

Why do we need models? According to Wilson, we need models because we want to check out our understanding of a particular system, conduct thought experiments and use those thought experiments to inform our design and analysis. According to Epstein, with models “we can check with a hypothesized micro-specification suffice to generate the observed phenomenon”. According to Page in his online course Model Thinking, models can help us involve conversations, weed out the logic inconsistencies, engage and understand the data, and decide, strategize and design our ways to work with the world.

How does simulation modeling work and how computer scientists can make better simulation-modeling tools? Let us conduct some thought experiments.

Simplified Weather System

The first model is a simplified weather system. In this weather system, we have sunny days and rainy days. If today is a sunny day, tomorrow is more likely to be a sunny day because the weather tends to stay in a state for a while. Similarly, if today is a rainy day, tomorrow is more likely to be a rainy day. Sunny days tend to last longer, and the rainy days tend to last shorter. Let us for the sake of reasoning specify that if today is a sunny day ($X_t=\mbox{sunny}$), then with 90% probability that tomorrow is going to be a sunny day ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})=0.9$) and with 10% probability that tomorrow’s going to be a rainy day ($P(X_{t+1}=\mbox{rainy}|X_t=\mbox{sunny})=0.1$); If today’s a rainy day ($X_t=\mbox{rainy}$), then with 80% probability tomorrow is going to be a rainy day ($P(X_{t+1}=\mbox{rainy}|X_t=\mbox{rainy})=0.8$), and with 20% probability tomorrow is going to be a sunny day ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{rainy})=0.2$). We have from sunny to rainy, 10%, rainy to sunny 20%.

Given this simple model and given that today is a sunny/rainy day, we can simulate/sample this weather system: Given that today’s a sunny day, we can throw a biased coin that yields a head with probability 0.9 and yields a tail with probability 0.1, and we note a sunny day for tomorrow’s weather if we get a head and a rainy day if we get a tail. Given that today’s a rainy day, we can throw another biased coin that yields a head with probability 0.2 and yields a tail with probability 0.8, and we note a sunny day for tomorrow’s weather if we get a head and a rainy day if we get a tail. In this way we can get one sample path “sunny, sunny, sunny”, and another sample path “sunny, sunny, rainy”. Corresponding to each sample path, we have a probability: If today’s a sunny day, then the probability for tomorrow’s a sunny day is .9, and the probability for condition tomorrow’s a sunny day, the probability for the day after tomorrow is a sunny day is also probability of 9. A sequence “sunny, sunny, sunny” is associated with a probability $P(X_{t+1}=\mbox{sunny},X_{t+2}=\mbox{sunny}|X_{t}=\mbox{sunny})=.9\times .9=.81$. Similarly, a sequence “sunny, sunny, rainy” is associated with a probability which is $P(X_{t+1}=\mbox{sunny},X_{t+2}=\mbox{rainy}|X_{t}=\mbox{sunny})=.9\times .1=.09$.

So we have defined a a way to sample system, and we have associated some randomness to different sample paths of this system. In this sense we have defined a generative model which we can use to generate different sample paths with different probabilities.

Now suppose that I at Buffalo chat with my girlfriend at San Francisco everyday. While she doesn’t tell me the daily weathers in San Francisco, she tells me whether she focuses on cleaning her apartment or she hikes around every day. On a sunny day, she is more likely to hike around and on a rainy day he is more likely to clean her apartment. Let us make some inferences about San Francisco’s weather from my girlfriend’s daily activities.

We will use forward-backward algorithm, which comes from Bayesian theorem. Suppose that today is a sunny day in San Francisco, and tomorrow my girlfriend tells me that she’s cleaning her apartment all day. What is going to be the probability for tomorrow to be a sunny day, and what is the probability for tomorrow to be a rainy day? Let $X_t$ represent the latent weather in San Francisco (which I do not know at day $t$) and Let $Y_t$ represent the observation of my girlfriend’s activity at day $t$.

Here’s our reasoning: Conditioned on that today is sunny, the probability for tomorrow as a sunny day is 90% ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})=0.9$) and the probability for tomorrow to be a rainy day is 10%. Conditioned on that tomorrow is a sunny day, my girlfriend cleans her apartment with 10% probability ($P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{sunny})=0.1$), and walks outside with 90% probability($P(Y_{t+1}=\mbox{walking}|X_{t+1}=\mbox{sunny})=0.9$). Conditioned on that tomorrow is a rainy day, my girlfriend cleans her apartment with 90% probability and walks outside with 10% probability. Since my girlfriend tells me that she cleans her apartment tomorrow, my girlfriend could clean her apartment on a sunny day, and she could also clean her apartment on a rainy day. The probability for my girlfriend to clean her apartment on a sunny day tomorrow conditioned on that today is a sunny day is $P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})\times P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{sunny}) =.9 \times .1 = .09$, and the probability for my girlfriend to clean her apartment on a rainy day conditioned on that today is a sunny day is $P(X_{t+1}=\mbox{rainy}|X_t=\mbox{sunny})\times P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{rainy}) = .1\times .9 = .09$. The total probability for my girlfriend to clean her apartment is $.09+.09=.18$ , and conditioned on that tomorrow my girlfriend cleans her apartment, the probability for tomorrow to be a sunny day is ${.09\over .09+.09}=50\%$ , and the probability for tomorrow to be a rainy day is also $50\%$. This is the kind of reasoning that we engage to work with a stochastic processes.

Now suppose my girlfriend and I have chatted about her daily activities for many, many days. We have a sequence of activities such as “walking, walking and cleaning”. Instead of asking the probability for a specific day to be a sunny day or for a specific day to be a rainy day given daily activities, we ask, what is the most likely sequence of weathers corresponding to the sequence of activities.

The challenge to solve this inquiry is that the size of our search space increases exponentially with the number of days in the sequence: Conditioned on today’s weather, tomorrow’s weather is one of two states, and the weathers of tomorrow and the day after tomorrow will take 1 of  the $2 \times 2=4$ states, and the weathers of the next three days will take one of the $2 \times 2 \times 2=8$ states, and so on. In order to efficiently find out the most likely path, we will use a technique known as dynamic programming. If we know the most likely weather sequence from day 1 to day $t$ ending at state $X_t$ (sunny or rainy) at day $t$, the most likely sequence from day 1 to day $t+1$ ending at state $X_{t+1}$ is the one with the largest probability that passes through $X_t$ and $X_{t+1}$. In this way, we can efficiently compress the space for our search, and thus makes a search in an exponential state space into one with polynomial state space. That is the idea of the Viterbi algorithm.

Now suppose we have estimated the probability distributions of daily weathers at San Francisco from my girlfriend’s daily activities, or we have estimated the most likely sequence. Can we estimate the parameters of our stochastic process model? — What is the likelihood for tomorrow to be a sunny day or a rainy day conditioned on that today is a sunny day or rainy day? Asked alternatively what is the average duration of a stretch of sunny days before the weather changes, or what is the average duration of a stretch of rainy days before the weather changes? What is the event rate of a weather change from a sunny day to a rainy day, or from a rainy day to a sunny day? To answer this question, we inspect $n$ stretches of sunny days that is $N$ days in total. Because the probability for weather change from sunny day to rainy day is 0.1, we expect $n\over N=0.1$ which means the average duration of a stretch of sunny days is $N\over n={1\over .1}=10$ days. Similarly, we estimate the average stretch of rainy days is ${1\over .2}=5$  days. The duration for a stretch of sunny days is longer because the probability for a weather change from sunny to rainy is smaller. What we’re talking about is the relationship of different random variable, a random variable with exponential distribution, a random variable with geometric distribution, and a random variable with Poisson distribution, and a random variable with binomial distribution. We’ll talk about those concepts later.

Predator-Prey Dynamics

The simplified weather system has very simple dynamics, and the reality is much more complicated. The predator-prey model, although still pretty simple, has found applications in ecology and many other areas. In this model we have two populations: the predator population and the prey population. This system evolves according to three events or “chemical reactions”: First, the prey individual in the system can multiply into two prey individuals. The rate (probability per unit time) for this event to happen on a specific prey individual is $c_1$. (The rate for the whole population is $c_1\times$ prey population.) Second, whenever a prey individual and a predator individual meet each other, the predator will turn this prey into a predator. This event will remove a prey individual from this system and introduce a predator into this system. The rate for this event is $c_2$. Third the predator individual will die. This event will remove a predator individual from the system. The rate for this event is $c_3$.

This type of system exists in reality, and people often observe that the prey population and predator population change periodically, and that the predator population follows the prey population. We will ask, why we have the periodical movement of the prey population and the predator population, how do we preserve the ecosystem, when the ecosystem is near collapsing and with what probability, and so on and so forth. With the dynamics of this system, we can proceed to reason about different ways to introduce intervention and change this system.

We defined a generative model of a dynamics system driven by three events — We defined a way to sample this system, just as we defined a way to sample the simplified weather system: At time zero, we have our initial state, which is comprised of the prey population and the predator population. Then we can sample the next event and change the the predator and prey populations according to the event. Then we sample the next event again, and change the state of the system again, and so on and so forth. Given a sequence of events, we will know the sequence of states of this system in terms of number of preys and the number of predators. Associated with each sequence of events, we also have a probability, which is the probability for this sequence of events to happen. In this sense, this predator prey model is pretty similar to the weather system that we have and the mathematical model. This is a model developed by a chemist.

We can define these dynamics of the system in term of Gillespie algorithm that generates a description, a probability measure of a stochastic process. We can proceed to sample this system. Then we often need to infer the evolution of the system from imperfect observations about this system. Then we can proceed to reason about this system and different interventions and engage data with simulation modeling. Our observation about the predator-prey system is imperfect because we cannot continuously track populations and often we can only track one of the two populations. Our reasoning about a predator-prey system could include: Whether this system is in danger of collapse, Whether it’s going to be a peak of population, whether it’s going to be a valley of population, or if we want to change the dynamics of the system in this or that way what we want to do.

Bales Interacting Process Analysis

Many times we cope with the data about people, and the dynamics about people are much more complicated. Bales interaction processes analysis model is a model that talks about how groups of people solve problems face to face. Let us consider Computer Supported Collaborative Ware (CSCW in short). In CSCW research, we use sensors to capture images, and audio signal, and motion signal, and other kinds of signal. From these signals, we are supposed to advise the group how this group can improve the performance of problem solving. In other words, we listen to a group of people speaking unknown language — we look at how they interact because we do not know what they talk about, and in the end we want to give the people some other advice about we can solve problems better. This is a tough problem. In order to solve this problem we will use simulation modeling to help us engage gigabytes of data collected from group discussion and generate reasoning.

Bales Interaction Process is one such model: For a group to solve problems, this group needs to involve different types of tasks —tasks such as asking for facts and giving facts, asking for new directions to explore and giving new directions, voting for opinions, and solving disputes. Different tasks will involve different features: If I give a fact, I may talk longer and the other people may be listening; If a group of person are vote for a direction, there will be a lot of overlapping in their speech signals. If there’s a lot of quarreling,  the pitch and the volume may be higher, and so on.

Hence although we listen some language that we don’t understand, with this model we can understand the dynamics, such as whether a person is giving a fact or whether there is agreement, disagreement going on, and whether a group of person is voting, and who and who are on one side, and who and who are on another side. Then in the end of discussion, although we still do not know the content of discussion, we can count, for example, how many facts have been given in this discussion, and how many voting processes are going on, and whether there are a lot of quarrellings or not, and who is on one side and who’s on other side, and so on and so forth. In the end, we can tabulate different activities in the discussion, and then this effectively shrink the search space from one gigabyte of data to perhaps one page of numbers, and being able to estimate the performance of this group through this one page of numbers will definitely be more reliable.

In this example, the weather system example and the predator prey model, we start from defining a probabilistic model driven by a set of events, proceed to infer the evolution of the system our observations about this system, finally reason about the statistics about this system and reason about introducing intervention. The inference problem is normally much harder, because we involve complex interactions, and we will talk about different ways to solve this problem.

Road Transportation Dynamics

Let us talk about simulation modeling of transportation engineering. Building a road or setting up tolls are not easy, because if things are implemented in the wrong way, there will be a lot of turmoil going on in a city. Also, building a road involves billions of dollars. So we have to first carefully reason about our plan through simulation modeling before we can really implement it.

Transportation engineers start simulation modeling from collecting data. They collect the road network, which is similar to the road network that you use in a GPS navigator. They also collect the population density and typical travel patterns, which are from the census data and travel surveys. In the census data we have different census tracts, and in each tract we have number of households, and the average incomes of the households, and population distribution, and so on. The transportation department calls different people and ask then different questions about daily travel such as: “What is the distance from where you live to where you work?”; “How many trips to you have, and what are the types of trips?”. From this kind of data, we can synthesize the typical trips of different people. Now we have the travel network, which supplies the transportation, and we have the travel demand, which is synthesized from the population surveys.

After we have those two different types of information, we proceed to put this data into a simulator. The simulator will simulate how people move out in the morning, how traffic jams form due to the interaction of many cars on the street, how traffic jams result in detours and increased the cost of commuters in terms of fuel, late arrival to the workspace, and so on. After the simulation, the individual agents in the multi-agent system will try to improve their travel plans this system just like we improve our trips from day to day in reality. The simulator will just go through this process again and again and again until the system reaches equilibrium. At this time, we know how the average people head out to work, and head back from the work, take lunch, and pick up kids, and so on.

This system evolution is defined by a sequence of events. If P is a person, and L is a location, which could be a road or a building, the events are $P L_1 \to P L_2$ with rate of happening being $c$. The events happen at different rates: If the road is long and the traffic is heavy, the rate of moving out of the road will be low; If one direction leads to a more popular place than another direction, there will be a higher probability for people to take the first direction. Again, we’re able to define the dynamics of people in terms of a stochastic process.

Now suppose that we can observe the trajectories of taxi cabs: whether a taxi cab moves slower than previously on a road, or whether there’re more taxi cabs on a road, and so on. We can proceed or engage reasoning in the same ways as we did before. Starting from the number and behavior of tracked vehicles in a road link, we can determine the total number of vehicles in the link by scaling and estimating traffic conditions. If we trace the origins and destinations of the estimated number of vehicles through the behavior of the simulator and fill any gaps with prior individual travel behaviors, we can extract information about the traffic at other road links. If we then iterate estimations between the traffic at links and the trip choices of simulated vehicles, we improve our estimation of both.

By observing how people move, we can infer not only road traffic, but also economical and social indices of large social systems. There are many sectors in our industry:  the education sector the service sector, the energy sector, the banking sector, and so on. People in different sectors move in different ways. Hence by studying how people move around and how people interact with each other, we might be able to estimate the GDP of a whole city, the regions with poverty, the distribution of different industrial sectors, and the formation of a city from the core city to a much larger city.

Nowadays data is ample, and the question is how you can engage the data in your modeling. For example, how can we infer the dynamic collaborator network from the DBLP data set, which is a publication database in the field of computer science? Why people collaborated in the past, who will collaborate in the future, and what will be the hot topics in a scientific field? For example, from the proximity data about computer science students captured with mobile phones, how do we figure out the typical paths through which cold and flu spread and organize campaigns to improve health? For example, from the D4D data set about the hourly locations and phone communications of 12 million people in Senegal, how can we make inferences about the road traffic, the big events at a specific places, the spreading of malaria, and so on.


The learning objective is that at the end of this semester, you will be able to simulate Markov processes. A Markov process is a process that, given the present, we do not have to worry about the past to make inferences about the future. You will be able to make inferences about these different Markov processes by combining simulation models with data, and you’ll be able to reason about dynamics of those processes: What is the average duration of a sunny day; What will be the first time for the prey population to be lower than some threshold? What will be the performance of a group? What is the probability for there to be a traffic jam at the certain places? What are the poorer areas, and what are the richer areas?

To achieve the learning objectives, we’ll talk about how do we simulate the Markov processes, and then we’ll talk about how do we make inference about Markov processes. We’ll talk about, we just mentioned exactly inference today, but most of the time, the processes are pretty complicated. We’ll have to use either MCMC, meaning a sampling or we will use some optimization technique to be able to find optimal approximate solution. Then we’ll talk about the properties of Markov processes, and then we will apply our knowledge to understand the different systems that you may encounter in the future. Most of the systems are systems about people. The tool that we will use in this course is R.

We plan to have four problem sets, and the first problem set will be out by the end of September the 11th, next Friday. I normally give two weeks for you to solve the problems. There will be four problem sets — But first, don’t worry, I think you will all get above B+; Second, if you have problems in finishing the problems, discuss with me. I think hands on experience is pretty important if we want to understand things. If we do not solve the problems.

At the end of the semester we will each submit a term paper, which could be either a literature review of some field that you are interested in related to machine learning and stochastic process, or some exploration about your own projects, or some preparation about the paper that you want to publish next.

Participation. In this semester, I will allocate 10% of the score in participation in the form of contributing  lecture notes to the course web log. I’ll finish the first few, you will do the rest, and I will put all notes together and share them with the whole class at the end of the semester, please register to the course web log so that you can contribute to lecture notes.

In summary, nowadays we have a lot of data, the industry is interested in big data, and academia is interested in big data as well. There are a lot of opportunities for us to understand social systems, chemical systems, biological systems, and so on from these data. The issue here is how we can engage those data and understand our systems and help people. In this course, we will study together on the mathematical tools that we might find useful in the future. I will see you in the next class.

Introduction [09/01/2015]