Particle filter, smoother and parameter learning


Let $X_{t}$ be the hidden states and $y_{t}$ be the observations of a discrete-time state-space model (Kalman filter and hidden Markov model) identified by a transition probability $P(X_{t+1}|X_{t})$ and an observation model $P(Y_{t}|X_{t})$, where $t=1,\cdots,T$. Particle filtering for making inferences about hidden states $X_{t}$ from observations $y_{1:t}$ is comprised of a predicting/sampling step to sample particle positionss $x_{t+1}^{1}\sim P(X_{t+1}|x_{t}^{i_{t}^{1}}),\cdots,x_{t+1}^{N}\sim P(X_{t+1}|x_{t}^{i_{N}^{1}})$ at time $t+1$ from the particles $x_{t}^{i_{t}^{1}},\cdots,x_{t}^{i_{t}^{N}}$ at time $t$, and an updating/resampling step to sample particle indices according to their likelihoods with respect to the observations $i_{t+1}^{1},\cdots,i_{t+1}^{N}\sim\mbox{Categorical}(P(y_{t+1}|x_{t+1}^{1}),\cdots,P(y_{t+1}|x_{t+1}^{N}))$, from the starting particle position $x_{0}^{1},\cdots,x_{0}^{N}=x_{0}$ and indices $i_{0}^{1}=1,\cdots,i_{0}^{N}=N$. The particle approximation of the filtering probability is $\hat{p}(x_{t}|y_{1,\cdots,t})=\frac{1}{N}\sum_{k}\delta(x_{t}^{i_{t}^{k}})\stackrel{N\to\infty}{\longrightarrow}p(x_{t}|y_{1,\cdots,t})$, and the particle approximation of the likelihood function is $\hat{p}(y_{1,\cdots,T})=\prod_{t}\hat{p}(y_{t}|y_{0,\cdots,t-1})=\prod_{t}\frac{1}{N}\sum_{k}\hat{p}(y_{t}|\xi_{t}^{k})\stackrel{N\to\infty}{\longrightarrow}p(y_{1,\cdots,T})$. The conditional probability of particle locations and indices from particle filtering $\psi^{N}$; the approximate likelihood of the of the particle locations $Z_n^N$ are given in the following. \begin{eqnarray} Z_{n}^{N} = \hat P(y_0,\cdots,y_n) & = & \left[{1\over N}\sum_{l=1}^{N}p(y_{0}|\xi_{0}^{l})\right]\cdot\prod_{k=1}^{n}\left[{1\over N}\sum_{l=1}^{N}p(y_{k}|\xi_{k}^{l})\right]\to Z_{n} = P(y_0)\cdot\prod_{k=1}^{n}p(y_{k}|y_{1,\cdots,k-1})\label{Z_approx}\\ \psi^{N}=P(\vec{\xi}\_{0},\cdots,\vec{\xi}\_{n},\vec{i}\_{0},\cdots,\vec{i}\_{n}|y_0,\cdots,y_n) & = & \left[\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\prod_{l=1}^{N}\frac{p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\sum_{l=1}^{N}p(y_{0}|\xi_{0}^{l})}\right]\cdot\prod_{k=1}^{n}\left[\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\prod_{l=1}^{N}\frac{p(y_{k}|\xi_{k}^{i_{k}^{l}})}{\sum_{l=1}^{N}p(y_{k}|\xi_{k}^{l})}\right]\label{psi}\\ \end{eqnarray} There are two common particle smoothing methods for making inferences about hidden states $X_{t}$ from observations $y_{1:T}$. The genealogical tree algorithm and the forward filtering backward sampling algorithm. Genealogical tree algorithm consists of iteratively tracing the ancestral line of particle positions for $x_{T}^{i_{T}^{1}},\cdots,x_{T}^{i_{T}^{N}}$: \[ x_{1,T}=x_{1}^{i_{1}^{i_{2}^{\cdot^{\cdot^{\cdot i_{T}^{k}}}}}},\cdots,x_{T,T}=x_{T-1}^{i_{T-1}^{i_{T}^{k}}},x_{T}^{i_{T}^{k}}\mbox{ where }k=1,\cdots,N. \] The ancestral lines including the particle positions at time $T$ form a particle approximation of the posterior distribution of state trajectory \[ \hat{p}(X_{1,\cdots,T}|y_{1,\cdots,T})=\frac{1}{N}\sum_{k}\delta_{x_{1,T}^{k},\cdots,x_{T,T}^{k}}(X_{1,\cdots,T})\stackrel{N\to\infty}{\longrightarrow}p(X_{1,\cdots,T}|y_{1,\cdots,T}). \] Hence the particles $x_{t,T}^{1},\cdots,x_{t,T}^{N}$ form an approximation of the smoothing probability $\hat{p}(X_{t}|y_{1,\cdots,T})=\frac{1}{N}\sum_{k}\delta_{x_{t,T}^{k}}(X_{t})\stackrel{N\to\infty}{\longrightarrow}p(X_{t}|y_{1,\cdots,T})$. A well known problem of the genealogical tree algorithm is that all particles at time $T$ will quickly converge to the same ancestors in their ancestral lines and subsequently the genealogical tree provides a poor approximation to the smoothing distribution $P(X_{t}|y_{1,\cdots,T})$ due to insufficient sample size. Forward filtering backward sampling Backward consists of resampling particle indices backward in time $j_{n}^{k}\sim\mbox{Categorical}(P(y_{n}|x_{n}^{1}),\cdots,P(y_{n}|x_{n}^{N}))$, and $j_{t}^{k}\sim\mbox{Categorical}(P(x_{t+1}^{j_{t+1}}|x_{t}^{1})P(y_{t}|x_{t}^{1}),\cdots,P(x_{t+1}^{j_{t+1}}|x_{t}^{N})P(y_{t}|x_{t}^{N}))$ for $t=n-1,\cdots,0$. The resulting particle trajectories $x_{1}^{i_{1}^{j_{1}^{k}}},\cdots,x_{T}^{i_{T}^{j_{T}^{k}}}$ form an approximation of the posterior distribution of the state trajectory and particles $x_{t,T}^{1},\cdots,x_{t,T}^{N}$ form an approximation of the smoothing probability. The conditional probability of particle locations and indices and a particle trajectory from backward resampling is consequently \begin{equation} k^{N}=P(\vec{\xi}\_{0}:\vec{\xi}\_{n},\vec{i}\_{0}:\vec{i}\_{n},j_{0}:j_{n}|y_{1:n})=\psi^{N}(\vec{\xi}\_{0},\cdots,\vec{\xi}\_{n},\vec{i}\_{0},\cdots,\vec{i}\_{n})\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\cdot\prod_{k=0}^{n-1}\frac{P(\xi_{t+1}^{j_{n+1}}|\xi_{t}^{j_{n}})P(y_{t}|\xi_{t}^{j_{n}})}{\sum_{l=1}^{N}P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{l})P(y_{t}|\xi_{t}^{l})}.\label{k} \end{equation} The genealogical tree algorithm and the forward filtering backward sampling algorithm both generate biased approimation of the posterior distribution of state trajectory. To remove the bias, we can resort to the Metropolis-Hastings algorithm or the importance sampling algorithm. The [Metropolis–Hastings algorithm][1] is a Markov chain Monte Carlo algorithm to draw a sequence of random samples from a probability distribution $P(x)$ usually on a very high dimensional space, provided that $P(x)$ is known up to a normalization factor. The lax requirement that $P(x)$ is known up to a normalization factor makes the Metropolis–Hastings algorithm particularly useful, because the normalization factor is often extremely difficult to calculate. The M-H algorithm proceeds by randomly moving around the sample space, often staying in high-density regions of $P(x)$ and occasionally exploring low-density regions according to the acceptance ratio $\alpha$ that compares the target distribution $P(x)$ and a proposal distribution. The <a>importance sampling</a> algorithm is a variance reduction technique to integrate a function against a probability distribution $P$, where the non-zero region of the function has a very small probability. To address the problem that a sample of distribution $P$ has insufficient number of points in the important region, we sample from another distribution $q$ that overweights this important region, and weight the sample points according to how the target distribution $P$ compares with the proposal distribution $q$. In either Metropolis-Hastings algorithm and importance sampling algorithm, we need to compare the proposal distribution and the target distribution. We proceed to show that the probability for us to get a particle trajectory from either the genealogical tree algorithm or the FFBS algorithm times the empirical probability $Z^N_n$ (Eq. $\ref{Z_approx}$ equals the joint probability of the particle trajectory and the observations, thus is proportional to the posterior probability of the particle trajectory according to the original dynamics. Hence we can combine the particle trajectories from multiple particle smoothing runs with weights $Z^N_n$. To find the probability of an ancestral line $(\cdots,\xi_{n-1}^{i_{n-1}^{i_n^1}},\xi_n^{i_n^1})$ from the genealogical tree algorithm, we can integrate/sum over all other particle positions and indices $i_n^l$ for $l=\{2,\cdots,n\}$, $\xi_n^l$ for $l=\{1,\cdots,n\}\backslash\{i_n^l\}$, $i_{n-1}^l$ for $l=\{1,\cdots,n\}\backslash\{ i_n^1 \}$, $\xi_{n-1}^l$ for $l=\{1,\cdots,n\}\backslash\{ i_{n-1}^{i_n^1} \}$, …. \begin{eqnarray*} & & \sum_{\mbox{except }i_{n}^{1},i_{n-1}^{i_{n}^{1}},\cdots;\xi_{n}^{i_{n}^{1}},\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}},\cdots}\psi^{N}Z_{n}^{N}\cdot N^{n+1}\\ & = & \sum_{\mbox{except }i_{n}^{1},i_{n-1}^{i_{n}^{1}},\cdots;\xi_{n}^{i_{n}^{1}},\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}},\cdots}\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\frac{\prod_{l=1}^{N}p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{0}|\xi_{0}^{l})\right)^{N-1}}\prod_{k=1}^{n}\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\\ & = & \cdots\int d\xi_{n}^{1},\cdots,\xi_{n}^{i_{n}^{1}-1},\xi_{n}^{i_{n}^{1}+1},\cdots,\xi_{n}^{N}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})\sum_{i_{n}^{2},\cdots,i_{n}^{N}}\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\\ & = & \cdots\int d\xi_{n}^{1},\cdots,\xi_{n}^{i_{n}^{1}-1},\xi_{n}^{i_{n}^{1}+1},\cdots,\xi_{n}^{N}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})p(y_{n}|\xi_{n}^{i_{n}^{1}})\cancel{\prod_{l=2}^{N}\frac{\sum_{i_{n}^{l}=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})}}\\ & = & \cdots p(y_{n}|\xi_{n}^{i_{n}^{1}})p(\xi_{n}^{i_{n}^{1}}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})\cancel{\prod_{l\ne i_{n}^{1}}\int d\xi_{n}^{l}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})}\\ & = & p(y_{n}|\xi_{n}^{i_{n}^{1}})p(\xi_{n}^{i_{n}^{1}}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})p(\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}}|\xi_{n-2}^{i_{n-1}^{i_{n-1}^{i_{n}^{1}}}})\cdots \end{eqnarray*} To find the probability of a backward sample $\xi_0^{j_0},\cdots,\xi_n^{j_n}$, we marginalize over all other particle positions and indices $\xi_n^l$ for $l\ne j_n$, $i_n^l$ for $l=1,\cdots,N$, $x_{n-1}^l$ for $l\ne j_{n-1}$, $i_{n-1}^l$ for $l=1,\cdots,n$ …. \begin{align*} & \sum_{\text{except }j_{1:n},\xi_{1:n}^{j_{1:n}}}k^{N}Z_{n}^{N}\\ = & \sum\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\frac{\prod_{l=1}^{N}p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{0}|\xi_{0}^{l})\right)^{N-1}}\prod_{k=1}^{n}\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{k}|\xi_{k}^{i_{k}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{k}|\xi_{k}^{l})\right)^{N-1}}\cdot\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\prod_{k=0}^{n-1}\frac{P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{j_{t}})P(y_{t}|\xi_{t}^{j_{t}})}{\sum_{l=1}^{N}P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{l})P(y_{t}|\xi_{t}^{l})}\\ = & \sum\cdots\sum_{i_{n}^{1:N},\xi_{n}^{\backslash j_{n}}}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\\ = & \sum\cdots p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})\prod_{l\ne j_{n}}\bcancel{\sum_{\xi_{n}^{\backslash j_{n}}}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})}\frac{\cancel{\prod_{l=1}^{N}\sum_{i_{n}^{l}=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}}{\cancel{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}}\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\cancel{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}}\\ = & \sum\cdots\sum_{i_{n-1}^{1:N},\xi_{n-1}^{\backslash j_{n-1}}}p(\xi_{n-1}^{l}|\xi_{n-2}^{i_{n-2}^{l}})\frac{\prod_{l=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n-1}|\xi_{n-1}^{l})\right)^{N-1}}\frac{P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})}{\sum_{l=1}^{N}P(\xi_{n}^{j_{n}}|\xi_{n-1}^{l})P(y_{n-1}|\xi_{n-1}^{l})}\cdot p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})P(y_{n}|\xi_{n}^{j_{n}})\\ = & \sum\cdots p(\xi_{n-1}^{j_{n-1}}|\xi_{n-2}^{i_{n-2}^{j_{n-1}}})\bcancel{\sum_{l\ne j_{n-1}}p(\xi_{n-1}^{l}|\xi_{n-2}^{i_{n-2}^{l}})}\cancel{\sum_{i_{n-1}^{j_{n}}=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{j_{n}}})p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})}\frac{\cancel{\prod_{l\ne j_{n}}\sum_{i_{n-1}^{l}=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{l}})}}{\cancel{\left(\sum\limits \_{l=1}^{N}p(y\_{n-1}|\xi_{n-1}^{l})\right)^{N-1}}}\frac{P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})}{\cancel{\sum_{l=1}^{N}P(\xi_{n}^{j_{n}}|\xi_{n-1}^{l})P(y_{n-1}|\xi_{n-1}^{l})}}\cdot P(y_{n}|\xi_{n}^{j_{n}})\\ = & \sum\cdots\cdot p(\xi_{n-1}^{j_{n-1}}|\xi_{n-2}^{i_{n-2}^{j_{n-1}}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n}|\xi_{n}^{j_{n}})\\ = & \rho(\xi_{0}^{j_{0}})P(y_{0}|\xi_{0}^{j_{0}})\prod_{t=1}^{n}P(\xi_{t}^{j_{t}}|\xi_{t-1}^{j_{t-1}})P(y_{t}|\xi_{t}^{j_{t}}). \end{align*} Hence the $l$-th particle trajectory from the $k$-th particle smoothing run ${^k}\xi{\_{1:n}^{j\_{1:n}^l}}$ has an importance weight ${^k}Z{^N_n}$. In other words, ${1\over K}\sum_{k=1}^K {1\over L_k}\sum_{l=1}^{L_k} \delta_{{^k}\xi{\_{1:n}^{j\_{1:n}^l}}}(X_{1:n})\cdot {^k}Z{^N_n}$ is an estimate of the posterior probability $P(X_{1:n}|y_{1:n})$. The Metropolis-Hastings algorithm to sample particle trajectories from multiple particle smoothing runs work in the following way. Basis: Set the current particle trajectories from particle smoothing ${^*}\xi_{1:n}^{j_{1:n}^l}={^1}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_1$ and keep ${^*}Z^N_n={^1}Z^N_n$. Induction: Run particle smoothing to get ${^k}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_k$ and keep ${^k}Z^N_n$. Accept the new trajectories with probability $\alpha=\min(1,{{^k}Z^N_n\over {^*}Z^N_n})$ and update the current trajectories ${^*}\xi_{1:n}^{j_{1:n}^l}={^k}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_k$ and ${^*}Z^N_n={^k}Z^N_n$.

Particle filter, smoother and parameter learning

Probability Distributions [09/08/2015]

In this lecture, we’ll talk about several interesting probability distributions. I assume that perhaps you have already heard of those distributions before, but what I’m going to tell you is how those distributions come into being. After you know how those distributions come into being, you will not feel those distributions as mysterious. Another message that I want to tell you is that what I find is that engineering students, including me many, many years ago, normally just pick a distribution and use this distribution for random, without thinking about what is the connotation of such probability distribution. I hope that after you know how those probability distributions that I talk about in this lecture come into being, you will think about how the other distributions come into being. Then it will help you reason about the distributions that you will work with in the future.

Distributions from repeated Bernoulli trials

Let us tossing a coin repeatedly. Whenever we toss a coin, we conduct a Bernoulli trial, so we have, as a result, we could have a head up or we could have a tail up. We assign probability $p$ to the event that we get a head up. We assign probability $1-p$ to the events that we get a tail up. We call a head up as a success. We call a tail up as a fail. If we toss this coin for many, many times then we get interesting random variables and distributions. Formally speaking, let us use random variable $X$ to denote the result of a Bernoulli trial. $X$ can take two values: $X$ takes 1 with probability $p$ ($P(X=1)=p$) and $X$ takes 0 with probability $1-p$ ($P(X=0)=1-0)$.

What is the mean value and variance of this Bernoulli trial?

The expected value of $X$ is $\mbox E(X)=1\times p+0\times(1-p)=p$ according to the definition of expected value. This agrees with out intuition: If we bet on the result of a coin toss — if we win \$1 for a head and and nothing for a tail and if the probability of getting a head is $p$, we expect to get \$ $n\cdot p$ after $n$ trials, and \$ $p$ after 1 trial.

The variance of $X$ according to definition is the expected value of the square deviation of the outcomes to their mean value, $$\mbox{Var}(X)=\mbox E\left(X-\mbox E(X)\right)^2=(1-p)^2\cdot p+(0-0)^2\cdot(1-0)=p(1-p)$$.

Interesting random variables and distributions will arise from repeated coin tosses. If we throw this coin for $n$ times, then the number of heads $X$ after $n$ tosses has a binomial distribution. The number of fails until the first success and including the first success has geometric distribution. The number of successes before a specific number of fails has a negative binomial distribution. We will explain why it is called negative binomial distribution. The time until the first success, if the rate of success is $\lambda$, and the probability of success in any infinitesimal time $\Delta t$ is $\lambda\times\Delta t$, has  exponential distribution. Geometric distribution is a discretized version of exponential distribution. Exponential distribution is the continuous version of geometric distribution. The time to the $k^\mbox{th}$ success has a Gamma distribution. Negative binomial distribution is the discrete version of gamma distributions, and gamma distribution is the continuous version of negative binomial distribution. The time to the $k^\mbox{th}$ success out of the $n$ successes happening in a fixed time has Beta distribution. If we throw a dice instead of a coin, the numbers of times we get faces 1-6 out of $n$ throws has multi-nominal distribution. As we conduct more and more experiments, the average will have first moment, and second moment, with higher moments all vanish. It has normal distribution.

Binomial Distribution

Let us find the probability of getting $k$ heads from $n$ tosses, which is the probability for $X=\sum_{i=1}^n X_i$ to be $k$, where $X_1, X_2,  \cdots X_n$ are the outcomes from a single toss. $X$ has binomial distribution with rate $p$ and number of trials $n$.

What is the probability that we get 2 heads out of 5 toss? Let us use $1_1$ and $0_1$ to denote a success and a failure respectively from the first toss — $X_1=1$ if we get $1_1$ and $X_1=0$ if we get $0_1$. A sequence $1_1, 1_2, 0_3, 0_4, 0_5$ of tosses has 2 successes, happing with probability $p\cdot p\cdot (1-p)\cdot (1-p)\cdot (1-p)$, where $p$ is the probability of getting $1$ and $1-p$ is the probability of getting $0$. Any permutation of the sequence $1_1, 1_2, 0_3, 0_4, 0_5$ has 2 successes, and there are $5!$ of them. The $5!$ permutations fall into ${5\choose 2}={5!\over 2!\cdot(5-2)!}$ bins, with each bin containing the $2!(5-2)!$ sequences that are the same with the subscript is dropped. For example sequence $1_1, 1_2, 0_3, 0_4, 0_5$ and $1_2, 1_1, 0_3, 0_4, 0_5$ are in the same bin and they are the same sequence after their subscripts are dropped. Hence the probability of getting 2 successes out of 5 tosses is ${5\choose 2}p^2(1-p)^{5-2}$. In general the probability of getting $k$ successes out of $n$ tosses is ${n\choose k}p^k(1-p)^{n-k}$, where $0\le k\le n$. Thus a binomial random variables is from counting the number of successes in repeated Bernoulli trials.

What is the expected value of a binomial random variable $X$?

\mbox{E}X &=& \mbox{E}\sum_{i=1}^n X_i = \sum_{i=1}^n \mbox{E} X_i = n\cdot p.

What is the variance of the binomial random variable? There is a easy way, and there is a hard way. The harder way is through definition.

\mbox{Var}(\sum_{i=1}^{n}X_{i}) & = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}-\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\
& = & \mbox{E}\left(\left(\sum_{i=1}^{n}X_{i}\right)^{2}-2\left(\sum_{i=1}^{n}X_{i}\right)\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)+\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\right)\\
& = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)^{2}-2\mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)+\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\
& = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)^{2}-\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\
& = & \sum_{i=1}^{n}\mbox{E}X_{i}^{2}+\sum_{i,j=1}^{n}\mbox{E}\left(X_{i}X_{j}\right)-\left(n\cdot p\right)^{2}\\
& = & n\cdot\left(p(1-p)+p^{2}\right)+n^{2}\cdot p^{2}-\left(n\cdot p\right)^{2}\\
& = & n\cdot p(1-p)

The easier way is to note that the Bernoulli trials $X_1,\cdots,X_n$ are i.i.d. or independent, and identically distributed and that the variance of the sum of independent random variables is the sum of the variances of those variables: $$\mbox{Var}(\sum_i^n X_i)=\sum_i^n\mbox{Var}(X_i)=n\cdot p\cdot(1-p)$$.

Let $X_1$ be a binomial random variable that counts the number of successes in $n_1$ independent Bernoulli trials, with the probability of success in each trials being $p$. Let$X_1$ be a binomial random variable that counts the number of successes in $n_2$ independent Bernoulli trials, with the probability of success in each trials also being $p$. It follows that $X_1+X_2$ is a binomial distribution  that counts the number of successes in $n_1+n_2$ independent Bernoulli trials, with the probability of success in each trials being $p$.

Multinomial Distribution

If we throw a dice instead of a coin, the outcome is a categorical distribution over the face numbers $\{1,2,3,4,5,6\}$. The probabilities for us to get each of the six numbers sum up to 1. In general, we can define a categorical distribution over all $K$ possible outcomes.

We can throw a dice repeatedly, and the numbers of occurrences of all 6 numbers is a multinomial distribution. What is the probability that out of $n$ independent throws, we get $1$ for $n_1$ times, get $2$ for $n_2$ times, get $3$ for $n_3$ times, and so on, supposing that in a single throw we get 1 with probability $p_1$, get $2$ with probability $p_2$, get $3$ with probability $p_3$ and so on? The probability of getting a single sequence is $$P(x_1, x_2,\cdots,x_n)=p_{x_1}\cdot p_{x_2}\cdots p_{x_n}=p_1^{\sum_{i=1}^n \delta(x_i=1)}\cdots p_K^{\sum_{i=1}^n \delta(x_i=K)}=p_1^{n_1}\cdots p_K^{n_K}$$. The probability of any of the $n!$ permutations of this sequence will be the same. The permutations of $n_k$ tosses of outcome $k$ for $k=1,\cdots,K$ will give us the same sequence. Hence the probability of getting $n_1$ $1$, $n_2$ $2$, $\cdots$, $n_K$ $K$ is $P(n_1,\cdots,n_K)={n! \over n_1!\cdots n_K!} p_1^{n_1}\cdots p_K^{n_K}$.

If we throw the dice twice, the outcome of the first throw is independent of the outcome of the second throw. But if we throw this dice for just once, the outcomes are correlated — if we get a 1, we cannot get a 2 out of a single throw.

Geometric Distribution

The number of failed trials until we get the first success has geometric distribution. What is the probability of getting $k-1$ fails until the first success? The probability mass function is $P(X=k)=(1-p)^{k-1}p$ — We first get $k-1$ fails with probability $1-p$ each time, then we get a success with probability $p$. What is the probability for $X$ to be less than or equal to $k$ (for $X$ to be $1,2,\cdots,k$)? The cumulative probability function is $P(X\le k) = 1-P(X>k) = 1-(1-p)^k$ — we get failed trials for the first $k$ times. The mean value and variance of a geometric random variable can be found through elementary math.

A random variable with geometric distribution has memory-less property, $P(X>s+t)=P(X>s)P(X>t)$, the probability of getting fails in the first $s+t$ trials equals the probability of getting fails in the first $s$ trials and the probability of failing in the next $t$ trails, and the probability of failing in the next $t$ trials equals the probability of failing in the first $t$ trials (by the time homogeneous of our coin). The geometric distribution is also the distribution of maximum entropy over distributions with support on non-negative numbers with mean value $\mbox{E}(X)$.

Suppose the arrival of the school bus has geometric distribution, then no matter what time we arrive at the bus station, we expect to wait the same amount of time to get the next bus, which is the average waiting time. If it is the case, we can use geometric distribution to model the arrival time of school bus. Otherwise, a geometric distribution is not a pretty good distribution for this type of model.

Negative Binomial Distribution

A negative binomial distribution is the distribution of the number of successes before we get $r$ fails, including the $r^\mbox{th}$ fail. A geometric distribution is a special case of negative binomial distribution. It is the distribution of the time until the first success, if we just reverse success and fail.

What is the probability mass function of the negative binomial distribution? What is the probability that we have $K$ successes until we get $R$ fails? We have conducted a series of experiments, and in a series of experiences, we have $K$ successes, and we have $R$ fails, so we have a specific sequence with $K$ successes and $R$ fails, where the probability is $P$ to the power of $K$, and $1-P$ to the power of $R$, because we have $K$ successes, and $R$ fails. The number of sequences that we have for us to have K successes, and R fails, and that the last one is always a fail, so we have already worked this out for many times.${\bf P}(X=k) = {k+r-1\choose k}p^{k}\cdot \left(1-p\right)^r$

That is how we get the probability mass function for negative binomial distribution. The expected value of negative binomial distribution is $pr\over 1-p$. This is intuitively.

Why it is called a negative binomial distribution? The reason is that it is related to the binomial distribution in a foreign sense. The probability for the number of successes before R fails. This is a random variable, because out of different samples, we will get different number of successes before R fails. The probability for this random variable to be less than $S$ is the same as the probability that we have more than $R$ successes after $R+S$ trials. Our successes after $R+S$ trials is another random variable, which is binomial random variable. What we are saying here, is that the probability for this negative binomial random variable to be less than $S$, is going to probability for a binomial random variable, which is the result of $R+S$ trials, to be greater than $R$. That is the relationship between the negative binomial distribution, and the binomial distribution, and it is the reason we call this negative binomial distribution.

$P(\mbox{number of success before r failure}\le s) = P(\mbox{more than r successes after r+s trails})$

Exponential Distribution

The next distribution is the exponential distribution. The exponential distribution is the limit of geometric distribution. Previous, in geometric distribution, we conduct one experiment in 1 unit time. Now here we see that we can conduct a fractal experiment, in a sense. We can conduct experiment in $\Delta t$ time, except that the probability of success in this $\Delta t$ is $\lambda\cdot\Delta t$, so we have a constant rate. The smaller the interval, the smaller the probability of success.

This is how we get the probability for $X$ to be greater than $t$ is going to be this one.

$P(X\ge t) = \lim_{\Delta t\to 0}(1-\Delta t\cdot\lambda)^{{t\over\Delta t}-1}\cdot (\Delta t\cdot\lambda) / \Delta t = \lambda \exp(-\lambda t)$

If we take the derivative, then we will get the probability density function for exponential distribution. This is how exponential distribution arise from repeatedly throw coins. Similarly, we can find out the expected value of the exponential distribution, and the variation of exponential distribution. Since we have already said that the exponential distribution is continuous to its counterpart of a geometric distribution, then the same memory layers property of a geometric distribution can be carried over to a random variable of exponential distribution. That is how we get the memory-less property here.

Another very interesting property about exponential distribution is, suppose we have two exponential distributions. $Y1$, and $Y2$. Then, the minimum of $Y1$ and $Y2$ is also exponential distribution. We call it memory-less property. This is a distribution that has the maximum entropy, in a sense that each time we start afresh, and the past really does not matter. Each time, we have the same rate of event happening. That is why exponential distribution occurs a lot in engineering, and in other branches of science.

$\textbf{memoryless property}$: ${\bf P}(X>s+t) = {\bf P}(X>s){\bf P}(X>t)$
Exponential distribution is the $\textbf{only}$ continuous distribution with memoryless property
$\min(Y_1,Y_2)$ has exponential distribution with rate $\lambda_1+\lambda_2$ if $Y_1$ and $Y_2$ have exponential distribution with rate $\lambda_1$ and $\lambda_2$ respectively


Poisson Distribution

The next distribution is the poisson distribution. The poisson distribution is the number of success in unit time. We talk about the time until the first fail, the time until the first success, the time until $R$ success, and so on. Here, we’re looking at a fixed interval. Then we’re looking at the number of successes in this interval. So, what is the distribution of this one? It also arises from taking the limit from binomial distribution. We take a unit time, and we divide this unit time in equal parts. The rate of success is $\lambda$. The probability of success of each equal part, each unequal part is going to be $\lambda\over n$.

If we take N to the limit, we will get this one.

$\lim_{n\to\infty}{n\choose k}({\lambda\over n})^k(1-{\lambda\over n})^{n-k}=e^{-\lambda}{\lambda^k\over k!}$

E to the power of minus lambda, which is a scaling factor to make this to be 1. We also have this one. The reason is that, this one, if we take N to the limit, it is some kind of exponential term. The reason is that this one … 1 plus X to the power of 1 over X, as X, 10 to zero, is going to be E. We can change the size to other things. We get little variation from this end result of E. That is how we get this one. I think the main tool for us to get this distribution from, by taking the limit as this one.

We can similarly find out the expected value of this Poisson distribution.

${\bf E}X = \lambda$

There’s a square here, and the variance of this Poisson distribution. A good property of Poisson distribution is that, if we have $Y1$ and $Y2$ both to be Poisson distribution with rates $\lambda1$ and $\lambda2$, then the sum of those two Poisson distributions is going to be another Poisson distribution, with the rates $\lambda1 + \lambda2$.

We also have splitting property, which is that if $X$ has a Poisson distribution with rates $\lambda$. We sample from a Poisson distribution, and we conduct Bernoulli trials over this unit time, corresponding to each success. We name these as success with probability of $P$, and we name these as a failure as probability of $1-P$. The result of the name success is still a Poisson distribution, but the rate is thinner, which is $P\cdot \lambda$. This is a thinning property of Poisson distribution.

Gamma Distribution

The next distribution is gamma distribution, which arises often as a distribution of exponential family. The gamma distribution is the time until the $k^{\rm th}$ success. We talk about the negative binomial distribution, right? The gamma distribution is continuous counterpart of negative binomial distribution. Because gamma distribution is a continuous random variable, so we have a probability density function. The way to find out the probability density function of a gamma distribution is by considering the two distributions.

First, let’s specify that $X$ is the time to the $k^{\rm th}$ event in a Poisson process, meaning that if we sample the time to the next event of success, then we sample the time to the next success, and so on, so forth. Let’s use $Y$ to count the number of events in the interval from zero to $x$. The probability for $X$ to be less than $x$ is a cumulative probability function of a gamma distribution. The time to the $k^{\rm th}$ success is less than $x$, meaning that, in the interval from zero to $x$, there will be greater than or equal to $K$ number of events. If we talk about greater or equal to $K$ number of events, we’re talking about a Poisson distribution, and we say it is a random variable of Poisson distribution to take $K+1$, and so on, and so forth.

If we write this out, we get this one.

$P(X\le x)=1- P(Y<k)=1 – \sum_{i=0}^{k-1}{(\lambda x)^k\over k!}\exp(-\lambda x)$

That has the probability for this gamma distribution to be less than $x$.Because we are talking about continuous random variable, we can take the derivative, and find out the probability density function. This is how we get the probability density function of a gamma distribution.

${d\over dx}P(X\le x)={\lambda\over (k-1)!}(\lambda x)^{k-1}\exp(-{\lambda x})$

Normal Distribution

All of those distributions, if you take to the limit, in a sense, is going to be normal distribution. The first moment is the expected value of random variable X, and the second moment is the expected value of X squared. The higher order moment is the expected value of X cubed, and so on, and so forth. What we know is, if we take the random variable to some sequence of random variable to some limit, then the higher order moments will all disappear, and we’ll end up with only having first moment, and the second moment, and that is how we get the normal distribution.

$f(x) = {1\over \sqrt{2\pi}\sigma}\exp(-{1\over 2}({x-\mu\over\sigma})^2)$
– ${\bf E}X =\mu$
– $\sigma^2 X = \sigma^2$

Central Limit Theorem

– Let $X_1,X_2,\dots$ be i.i.d. with $\bf EX_1=0$ and $\bf EX_1^2=\sigma^2<\infty$
– Then $\frac{S_n}{\sqrt n}=\frac{\sum_{i\le n}X_i}{\sqrt n}\to\cal N(0,\sigma^2)$

Proof: moments higher than 2 converges to 0

– $\phi(t)=\bf E e^{itX}=\phi(0)+\phi'(0)t+\frac{1}{2}\phi”(0)t^2+o(t^2)\to 1-\frac{\sigma^2 t^2}{2}+o(t^2)$ as $t\to 0$
– $\bf Ee^{\frac{itS_n}{\sqrt t}}=(\phi(\frac{t}{\sqrt n}))^n\to e^{-\frac{1}{2}\sigma^2t^2}$ as $n\to\infty$







Probability Distributions [09/08/2015]

PS2 Solutions

Kalman Smoothing and Parameter Estimation

We work backward here.

& P(x_{k}|y_{1,\dots,T})\\
= & \sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,T})P(x_{k+1}|y_{1,\dots,T})\\
= & \overset{P(x_{k}|x_{k+1}y_{1,\dots,T})=\frac{P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}{\sum_{x_{k}}P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}=P(x_{k}|x_{k+1}y_{1,\dots,k})}{\sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,k})P(x_{k+1}|y_{1,\dots,T})}\\
= & \sum_{x_{k+1}}\frac{P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}{\sum_{x_{k}}P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}\cdot P(x_{k+1}|y_{1,\dots,T})

Hence to estimate the distribution of $x_{k}$ from $y_{1,\dots,T}$ in Kalman smoothing, we first need to find the joint distribution $P(x_{k+1}|x_{k})P(x_{k}|y_{1,\dots,k})$ just after the prediction step to estimate the distribution of $x_{k+1}$ in Kalman filtering, condition the distribution of $x_{k}$ from predicted value of $x_{k+1}$, join with the distribution of $x_{k+1}$ in Kalman smoothing, and marginalize over the distribution of $x_{k+1}$ in Kalman smoothing:

\end{matrix}\right) & \sim\mathcal{N}\left(\left(\begin{matrix}\hat{x}_{k|k}\\
\end{matrix}\right),\left(\begin{matrix}P_{k|k} & P_{k|k}F_{k+1}^{T}\\
F_{k+1}P_{k|k} & F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}
X_{k|k}|x_{k+1|k} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\
& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)\\
X_{k|k} & =\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right)+\epsilon\\
& \mbox{where }\epsilon\sim\mathcal{N}(0,P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k})\\
X_{k|T} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(\hat{x}_{k+1|T}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\
& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}P_{k|T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}+P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)


  • innovation in smoothing is $\delta=\left(\hat{x}_{k+1|T} -\hat{x}_{k+1|k}\right)$
  • Covariance of innovation is $P_{k+1|k}=F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}$
  • Gain in smoothing is $G=P_{k|k}F_{k+1}^{T}P_{k+1|k}^{-1}$
  • Updated (a posterior) state estimate $\hat{x}_{k|T}=\hat{x}_{k|k}+G\delta$
  • Updated (a posterior) covariance matrix $P_{k|T}=P_{k|k}+G(P_{k+1|T}-P_{k+1|k})G^T$

Parameter Learning

We will estimate $F$, $H$, $Q$ and $R$ from observations $z_1,\dots z_T$ about the following system

  • $X_t = F X_{t-1} + w_t$, where $w_k\sim{\cal N}(0, Q)$ are i.i.d.
  • $Z_t = H X_t + v_t$, where $v_k\sim{\cal N}(0, R)$ are i.i.d. and independent of $w_{t’}$ for all $t’$

Likelihood of $X_{1,\dots,T},Z_{1,\dots,T}$

=&\left(2\pi\right)^{-\frac{d_X}{2}}\left|\Sigma_{0}\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)\right)\\
&\prod_{t=2}^{T}\left(2\pi\right)^{-\frac{d_X}{2}}\left|Q\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)\right)\cdot\\
&\prod_{t=1}^{T}\left(2\pi\right)^{-\frac{d_Z}{2}}\left|R\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)\right)

Expected log likelihood observing $Z_{1,\dots,T}$

\mathbb{E}_{X_{1,\dots,T}}\log P(X_{1,\dots,T},Z_{1,\dots,T})
=& {\rm constant} {-\frac{1}{2}}\log\left|\Sigma_{0}\right|-{1\over 2}\mathbb{E}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)+\\
&\sum_{t=2}^{T}{-\frac{1}{2}}\log\left|Q\right|-{1\over 2}\mathbb{E}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)+\\
&\sum_{t=1}^{T}{-\frac{1}{2}}\log\left|R\right|-{1\over 2}\mathbb{E}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)

Estimating state transition model $F$ and observation model $H$

\frac{\partial}{\partial F_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&Q^{-\mbox{T}}\sum\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}=Q^{-\mbox{T}}F\sum\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}\\
\frac{\partial}{\partial H_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0\Rightarrow&H=\left(\sum_{t=1}^{T}Z_{t}\mathbb{E}X_{t}^{\mbox{T}}\right)\left(\sum_{t=1}^{T}\mathbb{E}X_{t}X_{t}^{\mbox{T}}\right)^{-1}

Solution: write down an equation involving
$\left(\matrix{\frac{\partial\log P}{\partial F_{11}} & \dots & \frac{\partial\log P}{\partial F_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial\log P}{\partial F_{d1}} & \dots & \frac{\partial\log P}{\partial F_{dd}} }\right)$
and solve $F$, where $d$ is the dimensionality of $X$. Apply the same procedure to solve $H$.

Estimating variance of process noise $Q$ and observation noise $R$

\frac{\partial}{\partial Q_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&\left(T-1\right)Q-\sum_{t=2}^{T}\left(\mathbb{E}X_{t}X_{t}^{\mbox{T}}-F\mathbb{E}X_{t-1}X_{t}^{\mbox{T}}-\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}F^{\mbox{T}}+F\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}F^{\mbox{T}}\right)\stackrel{\mbox{set}}{=}0\\
\frac{\partial}{\partial R_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&R=\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}Z_{t}Z_{t}^{\mbox{T}}-H\mathbb{E}X_{t}Z_{t}^{\mbox{T}}-Z_{t}\mathbb{E}X_{t}^{\mbox{T}}H^{\mbox{T}}+H\mathbb{E}X_{t}X_{t}^{\mbox{T}}H^{\mbox{T}}\right)

Solution: Note that the determinant of a matrix is $|A| = \sum_{\sigma\in S_n}{\rm sgn}(\sigma)\prod_{k=1}^d A_{k\sigma_k}$. Hence $\left(\matrix{ \frac{\partial|A|}{\partial A_{11}} & \dots & \frac{\partial|A|}{\partial A_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial|A|}{\partial A_{d1}} & \dots & \frac{\partial|A|}{\partial A_{dd}}}\right)\cdot A^{\mbox{T}}=|A|\cdot I$ and $\frac{\partial\log|A|}{\partial A}=\frac{|A|}{|A|}A^{-T}$.

Schur complement

Our goal is to find inverse of the covariance matrix $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)$. In order to do that, we first use the 2nd column $\left(\matrix{\Sigma_{12}\cr\Sigma_{22}}\right)$ to Gaussian-eliminate $\Sigma_{21}$, which is equivalent to right multiply the covariance matrix with $\left(\matrix{I & 0 \cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$. We then use the resulting 2nd row $\left(\matrix{0 & \Sigma_{22}}\right)$ to eliminate $\Sigma_{12}$, which is equivalent to left multiply the covariance matrix with $\left(\matrix{ I & \Sigma_{12}\Sigma_{22}^{-1} \cr 0 & I}\right)$. We performed the following operation.

& \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)
\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)
\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)\\
= & \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr & I}\right)
\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\cr 0 & \Sigma_{22}}\right)\\
= & \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right).

If we move $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$ and $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)$ to the right hand side of the equation above, then $\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$ and the resulting $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)^{-1}$ to the left hand side of the equation, we get the following.

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)^{-1}
= & \left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)
\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)
\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

Application of Schur complement to find the conditional probability density function of a multivariate normal distribution. Let $x = \left(\matrix{X_1\cr X_2}\right)\sim \mathcal{N}\left(\left(\matrix{\mu_1\cr\mu_2}\right), \left(\matrix{\Sigma_{11} &\Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right) \right)$ be multivariate normal distribution, where $X_1$ and $X_2$ are vectors of dimension $p$ and $q$ respectively, and we want to find the conditional distribution of $X_1$ on $X_2=x_2$.

The probability density function of $X$ is

& f\left(\matrix{x_{1}\cr x_{2}}\right)\\
\propto & \exp\left(-\frac{1}{2}
\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)
\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1}
\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\
= & \exp\left(-\frac{1}{2}
\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)
\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)
\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)
\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)
\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\
=& \exp\left(\left(\begin{matrix}\left(x_{1}-\mu_{1}\right)^{T}-\left(x_{2}-\mu_{2}\right)^{T}\Sigma_{22}^{-1}\Sigma_{21} & \left(x_{2}-\mu_{2}\right)^{T}\end{matrix}\right)
\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)
\left(\matrix{x_{1}-\mu_{1}-\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right)\cr x_{2}-\mu_{2}}\right)\right).

Hence $X_{1}|X_{2}=x_{2}\sim\mathcal{N}\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)$.


Duality between maximum entropy / minimum divergence and maximum likelihood

1. Convexity of $f^∗(y)=\sup_x(y^\mbox{T}⋅x−f(x))$: We want to show that $f^*(\alpha_1*y_1+\alpha_2*y_2)\le \alpha_1*f^*(y_1)+\alpha_2*f^*(y_2)$ for any $0\le\alpha_1,\alpha_2\le 1$ with $\alpha_1+\alpha_2=1$, $y_1$ and $y_2$. In fact we have

f^{*}(\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2}) & = & \sup_{x}((\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2})^{\mbox{T}}x-f(x))\\
& = & \sup_{x}(\alpha_{1}(y_{1}^{\mbox{T}}x-f(x))+\alpha_{2}(y_{2}^{\mbox{T}}x-f(x)))\\
& \le & \alpha_{1}\sup_{x_{1}}((y_{1}^{\mbox{T}}x-f(x)))+\alpha_{2}\sup_{x_{2}}((y_{2}^{\mbox{T}}x-f(x)))\\
& = & \alpha_{1}f^{*}(y_{1})+\alpha_{2}f^{*}(y_{2})\\

2. If $f(x)$ is differentiable in $x$, the supremum $\sup_x y^\mbox{T}\cdot x – f(x)$ is attained at $x^*$ with ${\partial\over\partial x}=0$, i.e., $y-f'(x^*)=0$ or $y$ is the derivative of $f$ at point $x^*$. By definition of supremum, $y^\mbox{T}\cdot x-f(x)\le f^*(y)$. We call $(x^*, f'(x^*))$ Legendre conjugate pair.

3. Double Legendre-Fenchel dual $f^{**}(x)$ as the Legendre-Fenchel dual of $f^*(y)$ is convex as the result of 1. The double Legendre-Fenchel dual of a convex function is itself. Suppose $g(x)\ge f^{**}(x)$ is a convex function below $f$. Then $\max(f^{**},g)$ is a convex function below $f$. Hence double Legendre-Fenchel dual of a function is the maximal convex function below.

4. Let $x=(x_1,x_2,\cdots)$ be the mean parameters of categorical distribution. Then $\phi(x)=x^\mbox{T}\cdot\log x=(\log x)^\mbox{T} \cdot x$ is the entropy of this distribution, $\triangledown \phi(x) = 1+\log(x)$, and
d_{\phi}(y,x) & = & \phi(y)-\phi(x)-\triangledown\phi(x)\cdot(y-x)\\
& = & y^{\mbox{T}}\cdot\log y-x^{\mbox{T}}\cdot\log x-(1+\log x)^{\mbox{T}}(y-x)\\
& = & y^{\mbox{T}}\cdot\log y-y^{\mbox{T}}\cdot\log x\\
& = & d_{\mbox{KL}}(x,y)

5. Since $\phi$ is convex, $\phi(\alpha\cdot x+(1-\alpha)\cdot y)\le\alpha\phi(x)+(1-\alpha)\phi(y)$ for $0\le \alpha\le 1$, or
\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{(1-\alpha)(\phi(y)-\phi(x))}{(1-\alpha)(y-x)}=\frac{\phi(y)-\phi(x)}{(y-x)},\\
\phi'(x)=\lim_{\alpha\to1}\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{\phi(y)-\phi(x)}{(y-x)},\\
d_{\phi}(y,x)=\phi(y)-\phi(x)-\phi'(x)\cdot(y-x) & \ge & 0.

6. For $0\le \alpha_1,\alpha_2\le 1$ and $\alpha_1+\alpha_2=1$,
d_{\phi}(\alpha_{1}y_{1}+\alpha_{2}y_{2},x) & = & \phi(\alpha_{1}y_{1}+\alpha_{2}y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\
& \le & \alpha_{1}\phi(y_{1})+\alpha_{2}\phi(y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\
& = & \alpha_{1}\left(\phi(y_{1})-\phi(x)-\phi'(x)(y_{1}-x)\right)+\alpha_{2}\left(\left(\phi(y_{2})-\phi(x)-\phi'(x)(y_{2}-x)\right)\right)\\
& = & \alpha_{1}d_{\phi}(x,y_{1})+\alpha_{2}d_{\phi}(x,y_{2}).

d_{\phi_{1}+\phi_{2}}(x,y) & = & (\phi_{1}+\phi_{2})(y)-(\phi_{1}+\phi_{2})(x)-(\phi_{1}+\phi_{2})(x)\cdot(y-x)\\
& = & \cdots=d_{\phi_{1}}(x,y)+d_{\phi_{2}}(x,y)

8. $\{x|d_\phi(x,u)=d_\phi(x,v)\}$ is a hyper plane: $d_{\phi}(x,u)=\phi(x)-\phi(u)+\phi'(u)(x-u)=\phi(x)-\phi(v)+\phi'(v)(x-v)=d_{\phi}(x,v)$. Rearranging the terms $\phi(u)-\phi(v)+\phi'(u)\cdot u-\phi'(v)\cdot v-(\phi'(u)-\phi'(v))\cdot x=0$.

9. We expand the expectationon $\mbox{E}_{X}d_{\phi}(X,u)=\mbox{E}\left(\phi(X)-\phi(u)-\phi'(u)(X-u)\right)=\mbox{E}\phi-\phi(u)-\phi'(u)(\mbox{E}X-u)$, and set its partial derivative over $u$ to be 0: $\frac{\partial}{\partial u}\mbox{E}_{X}d_{\phi}(X,u)=-\phi'(u)-\phi”(u)(\mbox{E}X-u)+\phi'(u)=0$. Since $\phi$ is convex, $\phi”\ge0$ and $u=\mbox{E}X$ minimizes expected K-L divergence.

10. $\psi(\theta)$ is convex-conjugate of $\phi(x)$, and $(\theta_1, x_1)$ and $(\theta_2,x_2)$ are conjugate pairs.

d_{\phi}(x_{2},x_{1}) & = & \phi(x_{2})-\phi(x_{1})-\phi'(x_{1})(x_{2}-x_{1})\\
& = & \theta_{2}^{\mbox{T}}\cdot x_{2}-\psi(\theta_{2})-\left(\theta_{1}^{\mbox{T}}\cdot x_{1}-\psi(\theta_{1})\right)-\theta_{1}(x_{2}-x_{1})\\
& = & \psi(\theta_{1})-\psi(\theta_{2})-x_{2}^{\mbox{T}}\left(\theta_{2}-\theta_{1}\right)\\
& = & d_{\psi}\left(\theta_{1},\theta_{2}\right).


H(q)-\sum_{i}\alpha_{i}\left(E_{q}(f_{i})-E_{\tilde{p}}(f_{i})\right) & \leftarrow & \sum_{k}q_{k}\log q_{k}-\sum_{i}\alpha_{i}\left(\sum_{k}f_{i}(x_{k})q_{k}-E_{\tilde{p}}(f_{i})\right)\\
\frac{\partial}{\partial q_{k}} & = & 1+\log q_{k}-\sum_{i}\alpha_{i}f_{i}(x_{k})\stackrel{\mbox{set}}{=}0\\
\Rightarrow q_{k} & \propto & \exp\left(\sum_{i}\alpha_{i}f_{i}(x_{k})\right)\\
& & \mbox{take to the limit.}

12. Given $q(x)=\exp(\theta\cdot g(x)-\Psi(\theta))$, $\log q(x)=\theta\cdot g(x)-\Psi(\theta)$. Setting partial derivative of log likelihood over parameter to 0: $\frac{\partial}{\partial\theta}\sum_{k}(\theta\cdot g(x^{(k)})-\Psi(\theta))=\sum_{k}g(x^{(k)})-N\cdot\frac{\partial}{\partial\theta}\Psi(\theta)\stackrel{\mbox{set}}{=}0$, we get $\mu=\frac{1}{N}\sum_{k=1}^{N}g(x^{(k)})$.

13. To find $\Psi^{*}(\mu)=\sup_{\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)$, we take the derivative of $\mu\cdot\theta-\Psi(\theta)$ over $\theta$: $$\frac{\partial}{\partial\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)=\mu-\underbrace{\frac{\sum_{x}f(x)\cdot\exp(\theta\cdot f(x))}{\sum_{x}\exp\left(\theta\cdot f(x)\right)}}_{\triangledown_{\theta}\Psi=\mbox{E}(f(x))=\mu}\stackrel{\mbox{set}}{=}0.$$
Hence $\mu=\triangledown_{\theta}\Psi$, and $$\Psi^{*}(\mu)=\theta\cdot\triangledown_{\theta}\Psi-\Psi(\theta)=\mbox{E}(\frac{\theta\cdot f(x)}{\Psi(\theta)})=\sum_{x}p(x;\theta)\log p(x;\theta)=-H(\theta).$$

Sample from finite state continuous time Markov process

1. There are 2 events in this system: weather change from sunny to rainy, and from rainy to sunny. Let the rate of the first event be $\alpha$ per day, and the rate of the latter event be $\beta$ per day. The number of events in $\Delta t$ days will be $\alpha\cdot\Delta t$ and $\beta\cdot\Delta t$ for $\Delta t$ small enough. This system is an inhomogeneous Poisson process.

Using first order approximation, the state transition kernel of one-day and one-hour is the following where $\Delta t$ takes value 1 and $1\over 24$ respectively,
\end{matrix}\left(\begin{matrix}1-\alpha & \alpha\\
\beta & 1-\beta
\end{matrix}\right)\cdot\Delta t.$$

Refer lecture slides of 09/01 for sampling in R.

2. Refer lecture slides of 09/01 for EM in R.

3. If we do not have observations, we take $P(Y_t|X_t)=1$. In other words, we do not update latent state distribution.

4. Refer stochastic kinetic model algorithm of 10/01/2015.

PS2 Solutions

Projects & Project ideas

In this post I collect together projects and ideas from the class and from myself. My criteria is that (a) a project should be a good complement to the content of this course, (b) can contribute to your career development, (c) are well defined and can be well managed and (d) can be finished in 4 weeks with 5 hours per week.

  • Review data sets that track the locations and proximity of people with mobile phones and other types of sensors: Where can those data sets be downloaded? What are the works on each data set? What are the interesting findings, innovative methodology and significant applications of these data sets? Sum of 10-based logarithm of the data size (in bytes) >100, references > 50 from most cited works to least cited works. Tell a cohesive story about these data sets.
  • Review social media data sets: Where can those data sets be downloaded? What are the works on each data set? What are the interesting findings, innovative methodology and significant applications of these data sets? Sum of 10-based logarithm of the data size (in bytes) >100, references > 50 from most cited works to least cited works. Tell a cohesive story about these data sets.
  • Review one of the open source simulation softwares: Demonstrate how to use the software by walking through one example, show the architecture of the software (modules and their interactions, entry points, key modules, etc.), show how to dump events or system events from the software. Those software could include UrbanSim, MATSim, TranSIMS, MetroPolis, STEM,  etc.
  • Define the dynamics of network evolution through a set of events, simulate network evolution, make inferences from observations using either variation method or MCMC. Hint: we can follow the examples in the problem sets, work with fewer events, no need to work with real world data.
  • Define the dynamics of diffusion through a set of events, simulate network evolution, make inferences from observations using either variation method or MCMC. Hint we can follow the examples in the problem sets. Hint: we can follow the examples in the problem sets, work with fewer events, no need to work with real world data.
  • Identify a real world data, postulate the dynamics, and conduct inference using one of the algorithms discussed in the lectures.
  • more to come….
Projects & Project ideas

CSE675 References

 1. Simulation of Markov Processes

1.1 What is simulation modeling and why should we care? Review of probability theory.

  • Stochastic modelling for systems biology [1], Chapter 1: What is simulation modeling? Why simulation modeling in systems biology?
  • The Big Book of Simulation Modeling: Multimethod Modeling with AnyLogic 6 [2], Chapter 1: What is simulation modeling? Why simulation modeling in studying complex social systems?
  • Generative Social Science: Studies in Agent-Based Computational Modeling (Princeton Studies in Complexity) [3], Chapter 1, The Generativist Manifesto, we do not understand “emergence” if we cannot construct it using agent-based modeling.
  • Interaction process analysis; a method for the study of small groups [4]: Social psychology, while seemingly too complicated to be modeled, has models

1.2 Where are the probability distributions (Bernoulli trial, binomial, multinomial, order statistics, exponential, Poisson, Gaussian and more) from?

  • Stochastic Simulation [5]: sampling techniques
  • Simulation Modeling and Analysis [6]: sampling techniques
  • Probabilities, Statistics and Data Modeling: An interesting e-book on where different probability distributions are originated, estimation and statistical testing, published by AI ACCESS ( Unfortunately this book is no longer available. I will share chapters later.

1.3 What are the random processes we often encounter? How do we sample from them?

  • Stochastic modelling for systems biology [1], chapter 6 on relating microscopic molecule-level interactions and macroscopic dynamics, chapter 8 on approximate algorithms.
  • Financial Modeling with Markov Jump Processes [7], on simulating Markov jump processes.

2. Making Inferences with Markov Processes

2.1 Exact inference with message passing and expectation maximization with applications to hidden Markov models and Kalman filter

  • A new approach to linear filtering and prediction problems [8], this is where Kalman filter was first brought forth by Kalman
  • Kalman filter on Wikipedia.
  • Extended Kalman filter is first order approximation to the state transition kernel when it is non-linear. Unscented Kalman filter is reported to out-perform extended Kalman filter.
  • Hidden Markov model: formulation, forward-backward method for latent state inference, EM algorithm for parameter estimation, Viterbi decoding algorithm to find maximum likelihood latent state path.

2.2 Legendre transform, K-L divergence, with applications to Markov processes that defy exact inference

  • Graphical Models, Exponential Families, and Variational Inference [9]: the best theoretical treatment of this topic, with loop belief propagation and expectation propagation discussed in chapter 4, mean field method in chapter 5 and parameter learning in chapter 6.
  • Factorial Hidden Markov Models [10]: This is  where Ghahramani and Jordan first applied mean field method to make inferences with graphical models.
  • Expectation Propogation for Approximate Inference in Dynamic Bayesian Networks [11]: duality between minimizing Bethe variation and minimizing negative log likelihood.
  • Expectation Propagation for approximate Bayesian inference [12]: This is where EP was first brought forth by Minka.

2.3 Markov chain Monte Carlo, with applications to Markov processes that defy exact inference

  • Stochastic modelling for systems biology [1]: chapter 9 on Gibbs sampling and Metropolis-Hastings sampling, chapter 10 on sampling based inference of stochastic kinetic models.

3. Complex Systems identified by Markov Processes

3.1 Evolution of social networks: random graph model, exponential random graph model, network formation game

  • Exponential Random Graph Models for Social Networks [13]: ERGM simulation, inference, parameter learning, applications; Temporal ERGM.
  • Game theory models of network formation: A strategic model of social and economic networks [14];The evolution of social and economic networks [15].
  • statnet, a suite of software packages for network analysis that implement recent advances in the statistical modeling of networks.
  • Physicists models on network evolution: On Random Graphs [16], On the Evolution of Random Graphs [17], Collective dynamics of ‘small-world’networks [18], Small worlds: the dynamics of networks between order and randomness [19], Emergence of scaling in random networks [20].
  • Real world network formation: Reality Commons data set and a network tracked with personal mobile phones [21].

3.2 Diffusion in social networks: opinion dynamics, culture dynamics & language dynamics

  • Statistical Physics of Social Dynamics [22]: Opinion dynamics, formation of languages and cultures, etc.
  • Interacting Particle Systems [23].
  • Agent-Based Models (Quantitative Applications in Social Sciences) [24]
  • NetLogo
  • Diffusion in real world: The spread of obesity in a large social network over 32 years [25] …

3.3 Urban dynamics: formation of cities, road transportation dynamics

  • Cities and complexity: understanding cities with cellular automata, agent-based models, and fractals [26]
  • Urban Dynamics [27]: book on the interaction between different industrial sections using system dynamics approach.
  • TRANSIMS: Transportation analysis and simulation system [28]
  • MATSIM-T: Aims, approach and implementation [29]

4. Stochastic process theory

4.1 Brownian motion process, Ito process

4.2 Markov jump process, Levy process

4.3 Non-equilibrium statistical mechanics


[1] D. J. Wilkinson, Stochastic modelling for systems biology, Boca Raton, FL: Taylor & Francis, 2006.
Title = {Stochastic modelling for systems biology},
Author = {Darren James. Wilkinson},
Publisher = {Taylor \& Francis},
Year = {2006},
Address = {Boca Raton, FL},
Date-added = {2012-08-11 18:32:17 +0000},
Date-modified = {2012-08-11 18:32:17 +0000}
[2] A. Borshchev, The big book of simulation modeling: multimethod modeling with anylogic 6, AnyLogic North America, 2013.
Title = {The Big Book of Simulation Modeling: Multimethod Modeling with AnyLogic 6},
Author = {Borshchev, Andrei},
Publisher = {AnyLogic North America},
Year = {2013},
Date-added = {2015-09-14 04:04:43 +0000},
Date-modified = {2015-09-14 04:04:43 +0000}
[3] J. M. M. Epstein, Generative social science: studies in agent-based computational modeling (princeton studies in complexity), Princeton University Press, 2007.
Title = {Generative Social Science: Studies in Agent-Based Computational Modeling (Princeton Studies in Complexity)},
Author = {Joshua M. M. Epstein},
Publisher = {Princeton University Press},
Year = {2007},
Date-added = {2012-07-23 15:03:41 +0000},
Date-modified = {2012-07-23 15:05:32 +0000}
[4] R. F. Bales, “Interaction process analysis; a method for the study of small groups.,” , 1950.
Title = {Interaction process analysis; a method for the study of small groups.},
Author = {Bales, Robert F},
Year = {1950},
Publisher = {Addison-Wesley}
[5] B. D. Ripley, Stochastic simulation, John Wiley & Sons, 2009, vol. 316.
Title = {Stochastic simulation},
Author = {Ripley, Brian D},
Publisher = {John Wiley \& Sons},
Year = {2009},
Volume = {316}
[6] D. W. Kelton and A. M. Law, Simulation modeling and analysis, McGraw Hill Boston, 2000.
Title = {Simulation modeling and analysis},
Author = {Kelton, W David and Law, Averill M},
Publisher = {McGraw Hill Boston},
Year = {2000}
[7] P. Tankov, Financial modelling with jump processes, Boca Raton, Fla.: Chapman & Hall/CRC, 2004.
Title = {Financial modelling with jump processes},
Author = {Peter Tankov},
Publisher = {Chapman \& Hall/CRC},
Year = {2004},
Address = {Boca Raton, Fla.},
Date-added = {2012-08-11 18:32:17 +0000},
Date-modified = {2012-08-11 18:32:17 +0000}
[8] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of fluids engineering, vol. 82, iss. 1, pp. 35-45, 1960.
Title = {A new approach to linear filtering and prediction problems},
Author = {Kalman, Rudolph Emil},
Journal = {Journal of Fluids Engineering},
Year = {1960},
Number = {1},
Pages = {35--45},
Volume = {82},
Publisher = {American Society of Mechanical Engineers}
[9] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and trends in machine learning, vol. 1, iss. 1-2, pp. 1-305, 2008.
Title = {Graphical Models, Exponential Families, and Variational Inference},
Author = {Martin J. Wainwright and Michael I. Jordan},
Journal = {Foundations and Trends in Machine Learning},
Year = {2008},
Number = {1-2},
Pages = {1-305},
Volume = {1},
Bibsource = {DBLP,},
Date-added = {2012-08-11 18:32:17 +0000},
Date-modified = {2012-08-11 18:32:17 +0000},
Ee = {}
[10] Z. Ghahramani and M. I. Jordan, “Factorial hidden markov models,” Machine learning, vol. 29, iss. 2-3, pp. 245-273, 1997.
Title = {Factorial Hidden Markov Models},
Author = {Zoubin Ghahramani and Michael I. Jordan},
Journal = {Machine Learning},
Year = {1997},
Number = {2-3},
Pages = {245-273},
Volume = {29},
Bibsource = {DBLP,},
Date-added = {2012-08-11 18:32:17 +0000},
Date-modified = {2012-08-11 18:32:17 +0000},
Ee = {}
[11] T. Heskes and O. Zoeter, “Expectation propogation for approximate inference in dynamic bayesian networks,” in Uai, 2002, pp. 216-223.
Title = {Expectation Propogation for Approximate Inference in Dynamic Bayesian Networks},
Author = {Tom Heskes and Onno Zoeter},
Booktitle = {UAI},
Year = {2002},
Pages = {216-223},
Bibsource = {DBLP,},
Crossref = {DBLP:conf/uai/2002},
Date-added = {2012-10-06 08:56:47 +0000},
Date-modified = {2012-10-06 08:56:47 +0000},
Ee = {{\&}smnu=2{\&}article_id=863{\&}proceeding_id=18}
[12] T. P. Minka, “Expectation propagation for approximate bayesian inference,” in Uai, 2001, pp. 362-369.
Title = {Expectation Propagation for approximate Bayesian inference},
Author = {Thomas P. Minka},
Booktitle = {UAI},
Year = {2001},
Pages = {362-369},
Bibsource = {DBLP,},
Crossref = {DBLP:conf/uai/2001},
Date-added = {2012-10-06 08:54:23 +0000},
Date-modified = {2012-10-06 08:54:23 +0000},
Ee = {{\&}smnu=2{\&}article_id=120{\&}proceeding_id=17}
[13] G. Robins, “Exponential random graph models for social networks,” Handbook of social network analysis. sage, 2011.
title={Exponential random graph models for social networks},
author={Robins, Garry},
journal={Handbook of Social Network Analysis. Sage},
[14] M. O. Jackson and A. Wolinsky, “A strategic model of social and economic networks,” Journal of economic theory, vol. 71, iss. 1, pp. 44-74, 1996.
title={A strategic model of social and economic networks},
author={Jackson, Matthew O and Wolinsky, Asher},
journal={Journal of economic theory},
[15] M. O. Jackson and A. Watts, “The evolution of social and economic networks,” Journal of economic theory, vol. 106, iss. 2, pp. 265-295, 2002.
title={The evolution of social and economic networks},
author={Jackson, Matthew O and Watts, Alison},
journal={Journal of Economic Theory},
[16] P. ERDdS and A. R&WI, “On random graphs i,” Publ. math. debrecen, vol. 6, pp. 290-297, 1959.
title={On random graphs I},
author={ERDdS, P and R\&WI, A},
journal={Publ. Math. Debrecen},
[17] P. Erd6s and A. Rényi, “On the evolution of random graphs,” Publ. math. inst. hungar. acad. sci, vol. 5, pp. 17-61, 1960.
title={On the evolution of random graphs},
author={Erd6s, Paul and R{\'e}nyi, A},
journal={Publ. Math. Inst. Hungar. Acad. Sci},
[18] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’networks,” Nature, vol. 393, iss. 6684, pp. 440-442, 1998.
title={Collective dynamics of ‘small-world’networks},
author={Watts, Duncan J and Strogatz, Steven H},
publisher={Nature Publishing Group}
[19] D. J. Watts, Small worlds: the dynamics of networks between order and randomness, Princeton university press, 1999.
title={Small worlds: the dynamics of networks between order and randomness},
author={Watts, Duncan J},
publisher={Princeton university press}
[20] A. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, iss. 5439, pp. 509-512, 1999.
title={Emergence of scaling in random networks},
author={Barab{\'a}si, Albert-L{\'a}szl{\'o} and Albert, R{\'e}ka},
publisher={American Association for the Advancement of Science}
[21] W. Dong, B. Lepri, and A. Pentland, “Modeling the co-evolution of behaviors and social relationships using mobile phone data,” in Mum, 2011, pp. 134-143.
Title = {Modeling the co-evolution of behaviors and social relationships using mobile phone data},
Author = {Wen Dong and Bruno Lepri and Alex Pentland},
Booktitle = {MUM},
Year = {2011},
Pages = {134-143},
Bibsource = {DBLP,},
Crossref = {DBLP:conf/mum/2011},
Date-added = {2012-08-11 18:32:17 +0000},
Date-modified = {2012-08-11 18:32:17 +0000},
Ee = {}
[22] C. Castellano, S. Fortunato, and V. Loreto, “Statistical physics of social dynamics,” Reviews of modern physics, vol. 81, iss. 2, pp. 591-646, 2009.
Title = {Statistical physics of social dynamics},
Author = {Claudio Castellano and Santo Fortunato and Vittorio Loreto},
Journal = {Reviews of Modern Physics},
Year = {2009},
Number = {2},
Pages = {591-646},
Volume = {81},
Date-added = {2012-07-30 12:17:08 +0000},
Date-modified = {2012-07-30 12:24:17 +0000}
[23] T. Liggett, Interacting particle systems, Springer Science & Business Media, 2012, vol. 276.
title={Interacting particle systems},
author={Liggett, Thomas},
publisher={Springer Science \& Business Media}
[24] N. Gilbert, Agent-based models (quantitative applications in social sciences), Sage Publications, Inc., 2007.
Title = {Agent-Based Models (Quantitative Applications in Social Sciences)},
Author = {Nigel Gilbert},
Publisher = {Sage Publications, Inc.},
Year = {2007},
Date-added = {2012-07-23 15:00:08 +0000},
Date-modified = {2012-07-30 15:33:34 +0000}
[25] N. A. Christakis and J. H. Fowler, “The spread of obesity in a large social network over 32 years,” New england journal of medicine, vol. 357, iss. 4, pp. 370-379, 2007.
title={The spread of obesity in a large social network over 32 years},
author={Christakis, Nicholas A and Fowler, James H},
journal={New England journal of medicine},
publisher={Mass Medical Soc}
[26] M. Batty, Cities and complexity: understanding cities with cellular automata, agent-based models, and fractals, Mit Press, 2007.
Title = {Cities and Complexity: Understanding Cities With Cellular Automata, Agent-Based Models, and Fractals},
Author = {Batty, M.},
Publisher = {Mit Press},
Year = {2007},
Bdsk-url-1 = {},
Date-added = {2012-07-24 01:21:51 +0000},
Date-modified = {2012-07-24 01:21:51 +0000},
ISBN = {9780262524797},
Url = {}
[27] J. W. Forrester, Urban dynamics, Massachusetts Institute of Technology, 1970, vol. 11.
Title = {Urban dynamics},
Author = {Forrester, Jay W},
Publisher = {Massachusetts Institute of Technology},
Year = {1970},
Number = {3},
Volume = {11},
Date-added = {2015-10-08 14:53:17 +0000},
Date-modified = {2015-10-08 14:53:42 +0000},
Journal = {IMR; Industrial Management Review (pre-1986)},
Keywords = {modeling},
Pages = {67}
[28] L. Smith, R. Beckman, and K. Baggerly, “Transims: transportation analysis and simulation system,” Los Alamos National Lab., NM (United States) 1995.
Title = {TRANSIMS: Transportation analysis and simulation system},
Author = {Smith, Laron and Beckman, Richard and Baggerly, Keith},
Institution = {Los Alamos National Lab., NM (United States)},
Year = {1995},
Date-added = {2015-09-14 04:12:22 +0000},
Date-modified = {2015-09-14 04:12:22 +0000}
[29] MATSim development team (ed.), “MATSIM-T: Aims, approach and implementation,” {IVT, ETH Zürich, Zürich} 2007.
Title = {{MATSIM-T: Aims, approach and implementation}},
Author = {{MATSim development team (ed.)}},
Institution = {{IVT, ETH Z\"urich, Z\"urich}},
Year = {2007},
Date-added = {2015-06-29 21:06:46 +0000},
Date-modified = {2015-06-29 21:08:45 +0000}


CSE675 References

Problem Set 1 Solution

Solution to Problem 1

1.1 There are ${6 \choose 1}=1$ event with 0 outcomes (also called atomic event), ${6 \choose 1}=6$ events with 1 outcomes, ${6 \choose 2}=15$ events with 2 outcomes, …, ${6 \choose 6}=1$ event with all 6 outcomes. The total number of events are $\sum_{i=0}^{6}{6 \choose i}=2^{6}=64$ events. The events with 2 outcomes are for instance “the outcome is either 1 or 2”.

1.2 The events generated by $X$ are $\left\{ \emptyset,\{\omega:X(\omega)=0\},\{\omega:X(\omega)=1\},\Omega\right\} $.
In other words, they are $\left\{ \emptyset,\{1,2\},\{3,4,5,6\},\{1,2,3,4,5,6\}\right\} $.

The events generated by $Y$ are $\left\{ \emptyset,\{\omega:Y(\omega)=0\},\{\omega:Y(\omega)=1\},\Omega\right\} $. In other words, they are $\left\{ \emptyset,\{2,3,5\},\{1,4,6\},\{1,2,3,4,5,6\}\right\} $.

The events generated by $X$ and $Y$ jointly are the events “generated” by the following 4 disjoint events: $A=\{\omega:X(\omega)=0\&Y(\omega)=0\}=\{2\}$, $B=\{\omega:X(\omega)=0\&Y(\omega)=1\}=\{1\}$, $C=\{\omega:X(\omega)=1\&Y(\omega)=0\}=\{3,5\}$ and $D=\{\omega:X(\omega)=1\&Y(\omega)=1\}=\{4,6\}$. There are $2^{4}=16$ of them: $\emptyset$, $A$, $B$, $C$, $D$, $A\cup B$, $A\cup C$, $A\cup D$, $B\cup C$, $B\cup D$, $C\cup D$, $A\cup B\cup C$, $A\cup B\cup D$, $A\cup C\cup D$, $B\cup C\cup D$, $\Omega$.

Since every outcomes have a equal probability of $\frac{1}{6}$. The probabilities associated with the events can be computed in terms of the number of outcomes in the events. For example $P(A)=\frac{1}{6}$, $P(B)=\frac{1}{6}$, $P(C)=\frac{2}{6}$ and $P(D)=\frac{2}{6}$.

The mean of $X$ is $\mbox{E}X=0\cdot P(X=0)+1\cdot P(X=1)=\frac{4}{6}$. The variance of $X$ is $(0-\mbox{E}X)^{2}\cdot P(X=0)+(1-\mbox{E}X)^{2}\cdot P(X=1)=P(X=0)P(X=1)=\frac{8}{36}$. The mean of $Y$ is $\mbox{E}Y=0\cdot P(Y=0)+1\cdot P(Y=1)=\frac{3}{6}$. The variance of $Y$ is $\frac{9}{36}$.

The events defined by $XY$ are “generated” by $\{\omega:(XY)(\omega)=0\}$ and $\{\omega:(XY)(\omega)=1\}=\{\omega:X(\omega)=1\}\cap\{\omega:Y(\omega)=1\}=\{4,6\}$, with probability $P(\{\omega:(XY)(\omega)=1\})=P(\{4,6\})=\frac{2}{6}$. The mean of $XY$ is hence $\frac{2}{6}$ and the variance is hence $\frac{8}{36}$.

The random variables $X$ and $Y$ are dependent, because $P(X,Y)\ne P(X)P(Y)$, where $(X,Y)\in\{(0,0),(0,1),(1,0),(1,1)\}$.

1.3 Conditional expectations are random variables, and we can work out the events defined by the conditional expectations as we did previously. $\left(\mbox{E}(A|{\cal F}_{X})\right)(\omega)=\sum_{i\in\mbox{range}(X)}A(\omega)P(X(\omega)=i)$.

& & \begin{array}{c|cccccc}
\omega & 1 & 2 & 3 & 4 & 5 & 6\\
\hline I(\omega) & 1 & 2 & 3 & 4 & 5 & 6\\
\left(\mbox{E}(I|{\cal F}_{X})\right)(\omega) & \frac{2}{6} & \frac{2}{6} & \frac{4}{6} & \frac{4}{6} & \frac{4}{6} & \frac{4}{6}\\
\left(\mbox{E}(I|{\cal F}_{X})\right)(\omega) & \frac{3}{6} & \frac{3}{6} & \frac{3}{6} & \frac{3}{6} & \frac{3}{6} & \frac{3}{6}\\
\left(\mbox{E}(I|{\cal F}_{X,Y})\right)(\omega) & \frac{1}{6} & \frac{1}{6} & \frac{2}{6} & \frac{2}{6} & \frac{2}{6} & \frac{2}{6}

Solution to Problem 2

2.1 Let us count the number of success in a unit time, if we conduct Bernoulli trials every $\Delta$ time repeatedly and the probability of success is $\lambda\Delta$. It is a binomial distribution $B(\frac{1}{\Delta},\lambda\Delta)$. The probability of $k$ successes is $p(k;\frac{1}{\Delta},\lambda\Delta)={\frac{1}{\Delta} \choose k}\left(\lambda\Delta\right)^{k}\left(1-\lambda\Delta\right)^{\frac{1}{\Delta}-k}$. Taking the limit
\lim_{\Delta\to0}p(k;\frac{1}{\Delta},\lambda\Delta) & = & \lim_{\Delta\to0}{\frac{1}{\Delta} \choose k}\left(\lambda\Delta\right)^{k}\left(1-\lambda\Delta\right)^{\frac{1}{\Delta}-k}\\
& = & \lim_{\Delta\to0}\frac{\frac{1}{\Delta}(\frac{1}{\Delta}-1)\cdots(\frac{1}{\Delta}-k+1)\cdot\Delta^{k}}{k!}\lambda^{k}\exp(-\lambda)\\
& = & \frac{1}{k!}\lambda^{k}\exp(-\lambda).
We get a Poisson distribution with rate $\lambda$.

2.2 Let $X$ have exponential distribution with rate $\lambda$. In other words , $X$ has probability density function $f(x)=\lambda\exp(-\lambda x)$ and cumulative probability function $P(\{X\le x\})=1-\exp(-\lambda x)$. It follows that
P(\{X\ge s+t\}) & = & \exp(-\lambda(s+t))\\
& = & \exp(-\lambda s)\cdot\exp(-\lambda t)\\
& = & P(\{X\ge s\})\cdot P(\{X\ge t\}).
Let $X$ have geometric distribution, with cumulative probability function $P(\{X\le k\})=1-(1-p)^{k}$. It follows that
P(\{X\ge s+t\}) & = & (1-p)^{s+t}\\
& = & (1-p)^{s}(1-p)^{t}\\
& = & P(\{X\ge s\})\cdot P(\{X\ge t\}).

In fact, the memory property can also be understand in terms of how we sample the time to the next success — We do not refer to the history in the sampling process.

2.3 We first construct the Lagrange function $L=-\sum_{i=1}^{\infty}p_{i}\log p_{i}+\lambda_{0}(\sum_{i=1}^{\infty}p_{i}-1)+\lambda_{1}(\sum_{i=1}^{\infty}i\cdot p_{i}-\mu)$. The Lagrange function is continuous with the parameters $p_{i}$. Taking the partial derivative of the Lagrange function over the parameters, we get
& & \frac{\partial L}{\partial p_{i}}=-1-\log p_{i}+\lambda_{0}+i\lambda_{1}\stackrel{\mbox{set}}{=}0\\
& \Rightarrow & p_{i}=\exp\left(1-\lambda_{0}-i\lambda_{1}\right)=\exp(1-\lambda_{0})\exp(-\lambda_{1}\cdot i).\\
& & \sum_{i=1}^{\infty}p_{i}=\exp(1-\lambda_{0})\sum_{i=1}^{\infty}\exp(-\lambda_{1}\cdot i)\stackrel{\mbox{set}}{=}1\\
& \Rightarrow & \exp(1-\lambda_{0})\frac{\exp(-\lambda_{1})}{1-\exp(-\lambda_{1})}=1\Rightarrow\exp(1-\lambda_{0})=\exp(\lambda_{1})-1\\
& & \sum_{i=1}^{\infty}i\cdot p_{i}\stackrel{\mbox{set}}{=}\left(\exp(\lambda_{1})-1\right)\sum_{i=1}^{\infty}i\cdot\exp(-\lambda_{1}\cdot i)\\
& & =\left(\exp(\lambda_{1})-1\right)\frac{\partial}{\partial\lambda_{1}}\sum_{i=1}^{\infty}\exp(-\lambda_{1}\cdot i)\\
& & =\left(\exp(\lambda_{1})-1\right)\frac{\partial}{\partial\lambda_{1}}\frac{\exp(-\lambda_{1})}{1-\exp(-\lambda_{1})}=-\frac{1}{\exp(\lambda_{1})-1}\stackrel{\mbox{set}}{=}\mu.
Hence $p_{i}=(1-\frac{1}{\mu})^{i-1}\frac{1}{\mu}$.

2.4 We first express the cumulative probability function of a Gamma distribution in terms a Poisson distribution $P(X\le x)=1-\sum_{i=0}^{k-1}\frac{(\lambda x)^{k}}{k!}\exp(-\lambda x)$, then take the partial derivative over “time” to get the probability density function: $\frac{d}{dx}P(X\le x)=\frac{\lambda}{(k-1)!}(\lambda x)^{k-1}\exp(-\lambda x)$

Solution to Problem 3

3.1 The moment generating function of a random variable $X$ is $\mbox{M}(t)=\mbox{E}\left(\exp(tX)\right)=\int dx\cdot p(x)\exp(t\cdot x)$. We prove the moment generating property by induction. Its $0^{\mbox{th}}$ order derivative at $t=0$ (i.e., its value at 0) is $M(0)=\int dx\cdot p(x)\exp(0\cdot x)=1$. Its $1^{\mbox{st}}$ order derivative is
\frac{\partial}{\partial t}\mbox{M}(t) & = & \frac{\partial}{\partial t}\int dx\cdot p(x)\exp(t\cdot x)\\
& = & \int dx\cdot p(x)\frac{\partial}{\partial t}\exp(t\cdot x)\mbox{ , with mild regularity condition}\\
& = & \int dx\cdot p(x)\cdot x\exp(t\cdot x).
Hence $\frac{\partial}{\partial t}\mbox{M}(t)|_{t=0}=\int dx\cdot p(x)\cdot x\exp(0\cdot x)=\int dx\cdot p(x)\cdot x=\mbox{E}(x)$.
Now suppose $\frac{\partial^{n}}{\partial t^{n}}\mbox{M}(t)=\int dx\cdot p(x)\cdot x^{n}\exp(t\cdot x)$,
it follows that $\frac{\partial^{n}}{\partial t^{n}}\mbox{M}(t)|_{t=0}=\mbox{E}(X^{n})$,
\frac{\partial^{n+1}}{\partial t^{n+1}}\mbox{M}(t) & = & \frac{\partial}{\partial t}\left(\frac{\partial^{n}}{\partial t^{n}}\mbox{M}(t)\right)\\
& = & \frac{\partial}{\partial t}\int dx\cdot p(x)\cdot x^{n}\exp(t\cdot x)\\
& = & \int dx\cdot p(x)\cdot x^{n}\frac{\partial}{\partial t}\exp(t\cdot x)\mbox{, with mild regularity condition}\\
& = & \int dx\cdot p(x)\cdot x^{n+1}\exp(t\cdot x).
Hence $\frac{\partial^{n+1}}{\partial t^{n+1}}\mbox{M}(t)|_{t=0}=\mbox{E}(X^{n+1})$.

Given random variable $X$ and its moment generating function $\mbox{M}_{X}(t)$, the linear transform of the random variable is $\alpha X+\beta$, where $\alpha$ and $\beta$ are real numbers (or vectors). The moment generating function by definition is
\mbox{M}_{\alpha X+\beta}(t) & = & \mbox{E}\left(\exp\left(t\cdot(\alpha x+\beta)\right)\right)\\
& = & \mbox{E}\left(\exp(\alpha t\cdot x)\exp(t\cdot\beta)\right)\\
& = & \mbox{E}\left(\exp(\alpha t\cdot x)\right)\cdot\exp(t\cdot\beta)\mbox{, linearity of expectation}\\
& = & \mbox{M}_{X}\left(\alpha t\right)\cdot\exp(t\cdot\beta).

Given indenpendent random variables $X_{1},\cdots,X_{n}$ with moment generating functions $\mbox{M}_{1}(t),\cdots,\mbox{M}_{n}(t)$. The moment generating function of the sum is
\mbox{M}_{\sum_{i=1}^{n}X_{i}}(t) & = & \mbox{E}\left(\exp(t\cdot\sum_{i=1}^{n}X_{i})\right)\\
& = & \mbox{E}\left(\prod_{i=1}^{n}\exp(t\cdot X_{i})\right)\\
& = & \prod_{i=1}^{n}\mbox{E}\left(\exp(t\cdot X_{i})\right)\mbox{, functions of independent r.v.s are independent}\\
& & \prod_{i=1}^{n}\mbox{M}_{X_{i}}\left(t\right).
3.2 The moment generating function of linear transform is $\mbox{M}_{\frac{X_{i}-\mu}{\sigma}}(t)=\mbox{M}_{X_{i}}(\frac{t}{\sigma})\cdot\exp(-\frac{\mu}{\sigma}\cdot t)$.
The moment generating function of the sum is $\mbox{M}_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}}(t)=\exp\left(-\frac{\sqrt{n}\mu}{\sigma}\cdot t\right)\prod_{i=1}^{n}M_{X_{i}}(\frac{t}{\sqrt{n}\sigma})$.

\mbox{M}_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}}(t) & = & \prod_{i=1}^{n}M_{\frac{X_{i}-\mu}{\sigma}}(\frac{t}{\sqrt{n}})\mbox{, m.g.f. of linear transformation}\\
& = & \prod_{i=1}^{n}\left(1+\frac{t}{\sqrt{n}}\cdot0+\frac{t^{2}}{2!\cdot n}1+\frac{1}{\sqrt{n}}\xi\right)\mbox{, Taylor expansion}\\
& = & \left(1+\frac{t^{2}}{2n}+\frac{1}{\sqrt{n}}\xi\right)^{n},\\
\lim_{n\to\infty}\mbox{M}_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}}(t) & = & \frac{t^{2}}{2}\mbox{, L’Hospital’s Rule.}

Because the moment generating functions of the series of random variables converge to that of the standard normal variable, and each moment generating function uniquely , it follows that the series of random variables converges to one with standard normal distribution.

Problem Set 1 Solution

Concepts from Probability Theory [09/03/2015]

In this lecture we will review some concepts from probability theory. Since we are using probability theory, and in particular we’re using the concept of stochastic process to understand our complex social systems. I think it is better that we can reach some common ground about the concepts of probability theory. Today we’ll begin to involve some math so please bare with me about that. The most important thing is that we use math to do different things, to do important things. My understanding is that there are different people doing different things. There are mathematicians, and there are physicists, and there are engineers.

All people do math but they do math in different ways. If we think about how mathematicians do math, they treat math as a string of symbols. What they do is, they set up some actions and they set up some rules of derivation then they foil the rules then they get pages after pages of symbols and that’s it. That’s how mathematicians do math. They don’t care about the interpretation because once they interpret this then it’s metamath, it’s not longer math. You may ask why they want to do that, just play with the symbols just like artists play with their brush. The reason is that by doing this, they will no longer be constrained by our real world so they’re in a virtual world. Although it’s not very easy, they gain new freedom in their thought experiment in doing new things because in the real world, they are constrained by forces and by physical laws. But in a virtual world defined by the actions and by the laws, they no longer have such constraints so they gain new freedom although it is not easy.

There are also physicists. Physicists also work with math but physicists work with math in a different way. A physicist is always very interested about the intuition about the math laws and the theorems. Whenever they have something, they ask, “What is the intuition and can we visualize that?” What is the force and what is acceleration? The emphasis of physicists is they want to find the intuition, they want to use math to assist them in their understanding of physical laws. This is their way of working with math.

Engineers work still in a very different way. Engineers tend to solve problems with math. This is what we do. One thing that I can say is that whenever we work with math, it’s always that we just build things. I think the most important thing is that we understand how we build those conclusions with math and I will show you the way of building the conclusions. First, it’s nothing that is not understandable. I think that we’re more interested in using math to solve our problems.

Sample Space, Events and Field of Events

We have heard a lot about probability as random events. Let’s first have some common ground on what is sample space and event, what is a field of events, what is expectation and conditional expectation.

Let’s throw a dice twice. What is the sample space? A sample space is the space of all outcomes ($\Omega$). From the sample space, we might outcome $(1,1)$, meaning we get a one from the first throw and get another one from the first throw, and so on and so forth. The size of the sample space is $6\times 6=36$.

If we continue to throw a dice then we get more and more information about the sequence.

  • Before we throw the dice, the only events that we can talk about is the set of all possible outcomes or its complement the empty set ($(\emptyset, \Omega)$).
  • After we throw the dice once, we get some information — the information about the first throw: We could get 1, 2, 3, 4, 5 or 6, which could be an even or odd number, or a prime or composite number. Those are all examples of events about the first throw. How large will it be the set of events from the first throw? — It is $2^6$: The event could include 1 or exclude 1, and it could include 2 or exclude 2, and so on. Different inclusions and exclusions give different events. If we consider whether or not this event include 1 then we have two choices. Then if we consider whether the event include 2, we have another two choices. In total, we will have $2^6$ different choices. Those are all different events. The set of events after the first throw includes the set before the first throw. (Verify it!)
  • When we throw this dice a second time, we can talk about the outcome of not only the first throw but also the second throw. The set of events after the first throw is a subset of the events of the second throw, which is a subset of the events in the third throw, and so on. (Verify it!) This is how, as we conduct more and more experiments, we’ll gain more and more information about this system and we will have larger and larger sets of events.

The sample space is $\Omega$. An event is any subset of $\Omega$. An atomic event is a event that cannot be subdivided, a compound event is a union of atomic events, the complement of an event is  the case that the event doesn’t happen. For example, after the first throw, the atomic events are all the numbers we could read from the dice: $\{1,2,3,4,5,6\}$, a even number is a compound event and its complement is an odd number. The set of all events, which is a set of sets, form a field. We call something a field if the following conditions are satisfied.

  • The empty sets should belong to the field.
  • The sample space $\Omega$ should belong to the field.
  • If we have two events $A$ and $B$, the union of the two events (either A or B), the intersection of the events (both A and B), and the subtraction (A but not B) are also events. For example, one event could be that the outcome is an odd number. Another event could be that the outcome is a prime number, then we can talk about odd prime number, odd or prime number, and a number that is odd and is not prime.

Random Variables

Up to now, we have defined a sample space, atomic events, compound events, and field of events. We will proceed to define random variables or functions defined on a field of events.

In the experiment of throwing the dice twice, we can define a function from our sample space which  $\{1,\cdots,6\}\times \{1,\cdots,6\}$, it’s a tuple. We can define a function on this sample space, meaning that we can assign numerical values to different outcomes. An outcome could be $(1,1)$, for example and then we set a function that is the sum of the throws $1+1=2$. Or we could define a characteristic function which equals 1 if the number yielded by the second throw is 3 times the number yielded by the first throw, and 0 otherwise. There are many different ways to define a function on a sample space of finite number of outcomes.

A random variable defines a field. Let us inspect a random variable $X: \{1,2,3,4,5,6\}\to\{0,1\}$ which is the characteristic function for an outcome to be an even number $X(\omega)=(\omega+1) \mod 2$. So, for $\omega$ taking 1,2,3,4,5,6 an $X(\omega)$ takes 0,1,0,1,0,1. What is the field defined by $X$? According to our definition of a field, we have to add in empty set $\emptyset$. We have to add in the sample space $\{1,2,3,4,5,6\}$. We have to add in event $\{X=0\}=\{\omega: X(\omega)=0\}$. The outcomes 1,3 and 5 belong to this set. We have to add in event $\{X=1\}$, which contains the outcomes 2,4, and 6. We can verify that this is the field generated by X: The empty set and the world set $\Omega$ belong to the set of sets. For any event $A$ and $B$, $A \cup B$, $A\cap B$ and $A\\  B$ also belong to a set.

In addition, there are many ways to define a field on the sample space, a random variable may be on field set or maybe not on this field. For example, we have a field of events defined by $X(\omega)$ — that $\omega$ is a odd number or a even number. Is random variable $Y(\omega)=\omega$ defined on the field generated by $X$? It is not because the atomic events ($\{1,2,3,4,5,6\}$) are not events generated by $X$. The field generated by $Y$ includes and is larger than the field generated by $X$. If we look at the field subset relationship in the other direction, we find that the events generated by X, which is $\{X=0\}$, $\{X=1\}$, the whole set $\Omega$, and the empty set $\emptyset$, all belong to the events generated by Y. In this incidence we say that random variable $X$ is $Y$ measurable or is the random variable defined on Y.

For another example, we can define a random variable as the sum of two throws. This sum defines a field generated by events $\{X_1+X_2+=2\},\cdots,\{X_1+X_2+=12\}$. (Verify this!)

It is a little complicated to define a random variable as a function defined on a field rather than to directly talk about atomic events and assign a probability measure. It is actually necessary if we want to deal with complicated randomness in reality. For example  the world economics as our sample space is too complex to be modeled. A random variable X such as “whether the stock price of Apple will rise or fall tomorrow” could be much simpler and manageable. X is a map from the very complicated world into a much simpler one, and X does not capture all randomness about the world economics. To see that X doesn’t capture all randomness in the world of economics, we can define a random variable Y, which is “whether the funding in my retirement account is going to rise or fall tomorrow”, and which is likely independent of random variable $X$ and not defined on the field generated by $X$. In contrast if we confine ourselves to only work with the values of X as our sample space, we cannot describe something as complex as world economics. In addition since X captures all randomness that we know of, we cannot defined a random variable $Y$ from the sample space defined by the value of $X$ that is independent of $X$. That is the reason why we bother to first define omega which is sample space and then define a random variable X which maps this sample space into a simpler space. We can start from some very complex world and we chip something out to study which is simple and manageable.

Probability, Expectation & Conditional Expectation

A probability measure is positive numbers assigned to the atomic events in a sample space. The sum of the probabilities assigned to the atomic events should sum up to 1 because we do not want to leave out something.

Let’s assign equal probability the 36 possible outcomes from two throws of a dice. The probability for each outcome is 1/36. Now that we have assigned probability to all of the atomic events, we can talk about the probability assigned to compound events, which is the sum of the probabilities assigned to its elements. For example, what is the probability that the sum of the two throws is 2? It is 1/36 because the only possible atomic event to generate a 2 is $(1,1)$. Similarly the probability for the sum of the two throws to be 3 is 2/36, and the probability for the sum to be 4 is 3/36.

After we have assigned probability to the sample space, we can estimate the expectation and variance of a random variable. (A random variable is a function that assigns a value to an outcome of a experiment.) What is the probability mass function of $X$, that is the sum of two throws, supposing that every one of the 36 outcomes are equally likely? The outcome of the sum of the two throws is from 2 to 12 (where 12 corresponds to $(6, 6)$), with each outcome $x$ assigned a probability that is the number of occurances of $x$ in the following table divided by 36. The sum of the probabilities sum up to 1.

+ & 1 & 2 & 3 & 4 & 5 & 6\\
\hline 1 & 2 & 3 & 4 & 5 & 6 & 7\\
2 & 3 & 4 & 5 & 6 & 7 & 8\\
3 & 4 & 5 & 6 & 7 & 8 & 9\\
4 & 5 & 6 & 7 & 8 & 9 & 10\\
5 & 6 & 7 & 8 & 9 & 10 & 11\\
6 & 7 & 8 & 9 & 10 & 11 & 12
What is the expectation of X? The expectation of this random variable is the weighted sum over all possible outcomes of X, where the weight is the probability ($E(X)==\sum_\omega X(\omega)P(\omega)=\sum_x x\cdot p(x)$). If we sum these together, we get 7. What is the variance of X? The variance of X is the weighted square of the difference between X to the average value of X, where the weight is the probability($\mbox{Var}(X)=\sum_x (x-E(X))^2\cdot p(x)$). We can show that $\mbox{Var}(X)=E(X^2)-(E(X))^2$.

Similarly, if we define a random variable $g\circ X$ by function composition, where random variable X is already a function. For example, if X is identity function $X(\omega)=\omega$, $g\circ X(\omega)=2\times\omega$ for $g(X)=2\times X$, and$g\circ X(\omega)=\omega^2$ for $g(X)=X^2$. We can estimate the expectation of function composition $E(g\circ X) = \sum_\omega g(X(\omega))p(\omega)$, where $\omega$ takes atomic events in the sample space. There is a simpler way to estimate expectation: $g\circ X=\sum_x g(x) p(x)$, because the field generated by $X$ is a subfield of the field generated by all $\omega\in \Omega$.

Now we talk about conditional expectationconditional probability and the conditional independence. Recall that we assigned probability 1/36 to each of those 36 events. From this probability assignment it follows that the probability for the first throw and the probability for the second throw are independent. Here is the reason. We know that $P((m,n))={1\over 36}$, $P((m,\bullet))={1\over 6}$ and $P((\bullet,n))={1\over 6}$ by our setup. Hence $P((m,n))=P((m,\bullet))\times P((\bullet,n))$. This means that given the result of the first throw does not help us much to make inference about what will be the result of the second throw. That’s the first throw and the second throw as independent based on our assignment of 1/36 to the atomic events.

For another example, conditioned on that is the sum of the two throws is 3, the probability of the first throw to yield 1 is 1/2 and the probability of the first throw to yield 2 is also 1/2. The reason is that we only have two cases that we can yield a sum of 3, which is $(1,2)$ and $(2,1)$.

Below are several important properties of expectation and variance:

  • Expectation of linear combination Let $X_i$ be random variables, then $\bf E(a_0+a_1 X_1+\dots + a_n X_n) = a_0 + a_1\bf E(X_1) + \dots + a_n \bf E(X_n)$. (Why?)
  • Expectation of independent product Let $X$ and $Y$ be independent random variables, then $\bf E(X Y) = \bf E(x)\bf E(Y)$. (Why?)
  • Law of total expectation Let $X$ be integrable random variable (i.e., $\bf E(|X|)<\infty$) and $Y$ be random variable, then $\bf E(X) = \bf E_Y(\bf E_{X|Y}(X|Y))$. (Why?)
  • Variance of linear combination Let $X_i$ be random variables, then $\bf{Var}(a_0+a_1 X_1+\dots + a_n X_n) = a_0^2 + a_1^2\bf{Var}(X_1) + \dots + a_n^2 \bf{Var}(X_n)$
  • $\bf{Var}(X) = \bf E(X^2) – {(\bf E(x))}^2$

A stochastic process evolves and we get more observations about it over time. The future observations are conditioned on the current and the past observations. That is how this conditional probability and conditional expectation emerge and why it is important in studying a stochastic process. Let us work with conditional expectation in a finite sample space $\Omega$. The field of events is the power set $2^\Omega$ or the set of all subsets of $\Omega$. We are given a field $\mathcal G$ generated by a partition $(D_1,\cdots,D_k)$ ($\cup_i D_i=\Omega$, $D_i\cap D_j=\emptyset$ for $i\ne j$). We are given a random variable X which assigns the atomic events to the atoms to a number $x_1,\cdots, x_p$. The conditional probability of a event A given event D is the probability of A and D divided by the probability of D, or the probability for A and D to both happen conditioned on that probability D happened. For example the conditional probability for the sum of two throws to be even ($X_1+X_2 \mod 2=0$) conditioned on that the first throw is a prime number ($X_1$ is prime). The conditional probability of event A given a field $\mathcal G$ is the probability of an outcome $\omega$ to be in $A$ conditioned on that we know which $D_i$ this $\omega$ belongs to: $P(A|\mathcal G)(\omega)=\sum_i P(A|D_i)I_{D_i}(\omega), where I is an indicator function and $d_1$ to $d_k$ make a partition of the whole sample space. So conditional probability is a random variable. For example the conditional probability for the sum of two throws to be a even number conditioned on that we know whether the first throw is a prime number or a composite number — The conditioned random variable (or field) represents things we know and the conditioning variable represents things we do not know.  Similarly we can also define conditional expectation of a random variable given a field $E(X|\mathcal G)=\sum_i x_i P(X=p_i|\mathcal G)$, and a conditional expectation is also a random variable.

Continuous Probability Space

We talked about how people work on a space with finite number of values. Now we move from finite to infinite. The way that we do this is actually very simple: We approximate the infinite in terms of finite, and take the limit. That is, we aim to approximate the probability of a measurable set in a continuous space better and better when the number of “tiles” to cover this measurable set goes to infinite. Similarly, we approximate a measurable functions with a finite sum of indicator functions of measurable sets and aim to improve our approximation of the continuous function as we sum over more and more indicator functions. A random variable is just a measurable function that maps a probability space into a measurable space.

When we move from a sample space with finite outcomes to one with continuous outcomes, we need to work with a sigma field. A sigma field is a field with the capability to move us from finite to infinite.In a $\sigma$-field, disjoints measurable sets are countably additive, meaning that we can add countably number of sets.

Let us sample a standard normal distribution using $X=\sum_{i=1}^{12}(U_i-{1\over 2})$, where $U_i$ takes uniform distribution on $[0,1]$. We can show that $X$ has mean 0, standard deviation 1, and probability density function very close to standard normal distribution.

Our sample space here is 12 dimensional $\prod_{i=1}^{12}[0,1]$. $X$ maps the 12 random variables of uniform distributions to the real line. We can define probability density function and cumulative probability function for real-value random variable $X$.

Bayesian Theorem

Bayesian theorem enables us to turn the conditional around. We have worked with Bayesian theorem in the last lecture about predicting whether from the daily activity of my girlfriend. Here we have another one. Suppose we have a very rare disease and this disease is pretty serious so we want to screen this disease in advance. Let $p$ represents positive clinical test and $d$ represent having disease. Let the probability of positive clinical test conditioned on disease be $P(p|d)=.98$ and let the probability of positive clinical test conditioned on no disease be $P(p|\Omega\\d)=.02$, and let the prior probability of disease be $P(d)=.0001$. If I’m diagnosed to have some positive clinical test result, what is the probability that I have a disease? We will use Bayesian theorem to turn the conditional around, from $P(p|d)$ to $P(d|p)$. It turns out that if I have positive clinic test, the probability for me to have this disease is still pretty small, .1%, which is still not much to worry about. On the other hand, since the prior probability is .001, a positive clinic test increase the probability of having disease by 10-fold.

What is the relationship between mutual independence and pairwise independence? Mutual independence implies pairwise independence but not the other way. In the following example, we have 2 out of 4 faces in red, green and blue, hence $P(r)=P(g)=P(b)={1\over 2}$. We have 1 out of 4 faces in red and green, green and blue, and red and blue, hence $P(rg)=P(gb)=P(br)={1\over 4}$ so we have pairwise independence. We have 1 out of 4 faces in red, green and blue, hence $P(rgb)={1\over 4}$ But since $P(rgb)={1\over 4}\ne P(r)P(g)P(b)={1\over 8}$, we do not have mutual independence.


Concepts from Probability Theory [09/03/2015]

Introduction [09/01/2015]

What is Simulation Modeling and Why do We Need it?

In this course, we’re going to talk about simulation modeling. What is simulation modeling? We have a world. We do a lot of things to make this world a better world: We build roads, we build cities, and we fight with diseases, and so on. Those are pretty expensive human endeavors because those activities may cause many man-months associated with millions of dollars, and they may have substantial consequences.

How can we make such activities efficient and error-free as much as we can? We do simulation modeling: First we find out the basic patterns of this real world. Then we construct a virtual world that duplicates the real world following the same set of rules. Then we simulate this virtual world, and we conduct thought experiments and reason about how we can improve this virtual world. After we finally find a way to improve this virtual world out of many trials and errors, we can proceed to implement this way in our real world. Why do we think this strategy to improve the virtual world will also improve the real world? The reason is that we have set up the virtual world to run under the same set of rules as the world world. This 1-to-1 map between the virtual world and the real world is our guarantee that whatever runs in the virtual world will be followed by the real world and vice versa. This type of reasoning doesn’t always work — We’re human beings and we make mistakes. We reflect on our past, plan our future, and conduct thought experiments at all times.

Simulation modeling has been widely applied by researchers from different disciplines to assist their scientific reason. We computer scientists need to understand the applications of simulation modeling to design algorithms and help researchers from other disciplines in their thought experiments. In systems biology, it finds its application to model the stochasticity of the chemical reactions happening at the cell level. According to Wilkinson, “modeling is attempt to describe in a precise way an understanding of the elements of a system of interests, their states, and their interactions with other elements” and it “should be sufficiently detailed”. Joshua Epstein — social scientist working on computational epidemics other dynamical systems would conduct research on generative social science by “situating an initial population of autonomous heterogeneous agents in a relevant spacial environment, allow them to interact according to simple local rules, and thereby generate or grow the microscopic regularity from the bottom up”. His approach is derived from from the approach in solid matter and statistic physicists. The idea is that we can make macroscopic observations about our system, such as temperature, or pressure and volume. We want to understand what’s going on there and how we can have pressure, and temperature, and volume. Our approach is simulate modeling: We try to construct the microscopic measure of our system from the microscopic interactions of the atoms. If we can construct the macroscopic phenomena from microscopic interactions of the atoms, we have a hypothesis about the dynamics of this system.

Why do we need models? According to Wilson, we need models because we want to check out our understanding of a particular system, conduct thought experiments and use those thought experiments to inform our design and analysis. According to Epstein, with models “we can check with a hypothesized micro-specification suffice to generate the observed phenomenon”. According to Page in his online course Model Thinking, models can help us involve conversations, weed out the logic inconsistencies, engage and understand the data, and decide, strategize and design our ways to work with the world.

How does simulation modeling work and how computer scientists can make better simulation-modeling tools? Let us conduct some thought experiments.

Simplified Weather System

The first model is a simplified weather system. In this weather system, we have sunny days and rainy days. If today is a sunny day, tomorrow is more likely to be a sunny day because the weather tends to stay in a state for a while. Similarly, if today is a rainy day, tomorrow is more likely to be a rainy day. Sunny days tend to last longer, and the rainy days tend to last shorter. Let us for the sake of reasoning specify that if today is a sunny day ($X_t=\mbox{sunny}$), then with 90% probability that tomorrow is going to be a sunny day ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})=0.9$) and with 10% probability that tomorrow’s going to be a rainy day ($P(X_{t+1}=\mbox{rainy}|X_t=\mbox{sunny})=0.1$); If today’s a rainy day ($X_t=\mbox{rainy}$), then with 80% probability tomorrow is going to be a rainy day ($P(X_{t+1}=\mbox{rainy}|X_t=\mbox{rainy})=0.8$), and with 20% probability tomorrow is going to be a sunny day ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{rainy})=0.2$). We have from sunny to rainy, 10%, rainy to sunny 20%.

Given this simple model and given that today is a sunny/rainy day, we can simulate/sample this weather system: Given that today’s a sunny day, we can throw a biased coin that yields a head with probability 0.9 and yields a tail with probability 0.1, and we note a sunny day for tomorrow’s weather if we get a head and a rainy day if we get a tail. Given that today’s a rainy day, we can throw another biased coin that yields a head with probability 0.2 and yields a tail with probability 0.8, and we note a sunny day for tomorrow’s weather if we get a head and a rainy day if we get a tail. In this way we can get one sample path “sunny, sunny, sunny”, and another sample path “sunny, sunny, rainy”. Corresponding to each sample path, we have a probability: If today’s a sunny day, then the probability for tomorrow’s a sunny day is .9, and the probability for condition tomorrow’s a sunny day, the probability for the day after tomorrow is a sunny day is also probability of 9. A sequence “sunny, sunny, sunny” is associated with a probability $P(X_{t+1}=\mbox{sunny},X_{t+2}=\mbox{sunny}|X_{t}=\mbox{sunny})=.9\times .9=.81$. Similarly, a sequence “sunny, sunny, rainy” is associated with a probability which is $P(X_{t+1}=\mbox{sunny},X_{t+2}=\mbox{rainy}|X_{t}=\mbox{sunny})=.9\times .1=.09$.

So we have defined a a way to sample system, and we have associated some randomness to different sample paths of this system. In this sense we have defined a generative model which we can use to generate different sample paths with different probabilities.

Now suppose that I at Buffalo chat with my girlfriend at San Francisco everyday. While she doesn’t tell me the daily weathers in San Francisco, she tells me whether she focuses on cleaning her apartment or she hikes around every day. On a sunny day, she is more likely to hike around and on a rainy day he is more likely to clean her apartment. Let us make some inferences about San Francisco’s weather from my girlfriend’s daily activities.

We will use forward-backward algorithm, which comes from Bayesian theorem. Suppose that today is a sunny day in San Francisco, and tomorrow my girlfriend tells me that she’s cleaning her apartment all day. What is going to be the probability for tomorrow to be a sunny day, and what is the probability for tomorrow to be a rainy day? Let $X_t$ represent the latent weather in San Francisco (which I do not know at day $t$) and Let $Y_t$ represent the observation of my girlfriend’s activity at day $t$.

Here’s our reasoning: Conditioned on that today is sunny, the probability for tomorrow as a sunny day is 90% ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})=0.9$) and the probability for tomorrow to be a rainy day is 10%. Conditioned on that tomorrow is a sunny day, my girlfriend cleans her apartment with 10% probability ($P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{sunny})=0.1$), and walks outside with 90% probability($P(Y_{t+1}=\mbox{walking}|X_{t+1}=\mbox{sunny})=0.9$). Conditioned on that tomorrow is a rainy day, my girlfriend cleans her apartment with 90% probability and walks outside with 10% probability. Since my girlfriend tells me that she cleans her apartment tomorrow, my girlfriend could clean her apartment on a sunny day, and she could also clean her apartment on a rainy day. The probability for my girlfriend to clean her apartment on a sunny day tomorrow conditioned on that today is a sunny day is $P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})\times P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{sunny}) =.9 \times .1 = .09$, and the probability for my girlfriend to clean her apartment on a rainy day conditioned on that today is a sunny day is $P(X_{t+1}=\mbox{rainy}|X_t=\mbox{sunny})\times P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{rainy}) = .1\times .9 = .09$. The total probability for my girlfriend to clean her apartment is $.09+.09=.18$ , and conditioned on that tomorrow my girlfriend cleans her apartment, the probability for tomorrow to be a sunny day is ${.09\over .09+.09}=50\%$ , and the probability for tomorrow to be a rainy day is also $50\%$. This is the kind of reasoning that we engage to work with a stochastic processes.

Now suppose my girlfriend and I have chatted about her daily activities for many, many days. We have a sequence of activities such as “walking, walking and cleaning”. Instead of asking the probability for a specific day to be a sunny day or for a specific day to be a rainy day given daily activities, we ask, what is the most likely sequence of weathers corresponding to the sequence of activities.

The challenge to solve this inquiry is that the size of our search space increases exponentially with the number of days in the sequence: Conditioned on today’s weather, tomorrow’s weather is one of two states, and the weathers of tomorrow and the day after tomorrow will take 1 of  the $2 \times 2=4$ states, and the weathers of the next three days will take one of the $2 \times 2 \times 2=8$ states, and so on. In order to efficiently find out the most likely path, we will use a technique known as dynamic programming. If we know the most likely weather sequence from day 1 to day $t$ ending at state $X_t$ (sunny or rainy) at day $t$, the most likely sequence from day 1 to day $t+1$ ending at state $X_{t+1}$ is the one with the largest probability that passes through $X_t$ and $X_{t+1}$. In this way, we can efficiently compress the space for our search, and thus makes a search in an exponential state space into one with polynomial state space. That is the idea of the Viterbi algorithm.

Now suppose we have estimated the probability distributions of daily weathers at San Francisco from my girlfriend’s daily activities, or we have estimated the most likely sequence. Can we estimate the parameters of our stochastic process model? — What is the likelihood for tomorrow to be a sunny day or a rainy day conditioned on that today is a sunny day or rainy day? Asked alternatively what is the average duration of a stretch of sunny days before the weather changes, or what is the average duration of a stretch of rainy days before the weather changes? What is the event rate of a weather change from a sunny day to a rainy day, or from a rainy day to a sunny day? To answer this question, we inspect $n$ stretches of sunny days that is $N$ days in total. Because the probability for weather change from sunny day to rainy day is 0.1, we expect $n\over N=0.1$ which means the average duration of a stretch of sunny days is $N\over n={1\over .1}=10$ days. Similarly, we estimate the average stretch of rainy days is ${1\over .2}=5$  days. The duration for a stretch of sunny days is longer because the probability for a weather change from sunny to rainy is smaller. What we’re talking about is the relationship of different random variable, a random variable with exponential distribution, a random variable with geometric distribution, and a random variable with Poisson distribution, and a random variable with binomial distribution. We’ll talk about those concepts later.

Predator-Prey Dynamics

The simplified weather system has very simple dynamics, and the reality is much more complicated. The predator-prey model, although still pretty simple, has found applications in ecology and many other areas. In this model we have two populations: the predator population and the prey population. This system evolves according to three events or “chemical reactions”: First, the prey individual in the system can multiply into two prey individuals. The rate (probability per unit time) for this event to happen on a specific prey individual is $c_1$. (The rate for the whole population is $c_1\times$ prey population.) Second, whenever a prey individual and a predator individual meet each other, the predator will turn this prey into a predator. This event will remove a prey individual from this system and introduce a predator into this system. The rate for this event is $c_2$. Third the predator individual will die. This event will remove a predator individual from the system. The rate for this event is $c_3$.

This type of system exists in reality, and people often observe that the prey population and predator population change periodically, and that the predator population follows the prey population. We will ask, why we have the periodical movement of the prey population and the predator population, how do we preserve the ecosystem, when the ecosystem is near collapsing and with what probability, and so on and so forth. With the dynamics of this system, we can proceed to reason about different ways to introduce intervention and change this system.

We defined a generative model of a dynamics system driven by three events — We defined a way to sample this system, just as we defined a way to sample the simplified weather system: At time zero, we have our initial state, which is comprised of the prey population and the predator population. Then we can sample the next event and change the the predator and prey populations according to the event. Then we sample the next event again, and change the state of the system again, and so on and so forth. Given a sequence of events, we will know the sequence of states of this system in terms of number of preys and the number of predators. Associated with each sequence of events, we also have a probability, which is the probability for this sequence of events to happen. In this sense, this predator prey model is pretty similar to the weather system that we have and the mathematical model. This is a model developed by a chemist.

We can define these dynamics of the system in term of Gillespie algorithm that generates a description, a probability measure of a stochastic process. We can proceed to sample this system. Then we often need to infer the evolution of the system from imperfect observations about this system. Then we can proceed to reason about this system and different interventions and engage data with simulation modeling. Our observation about the predator-prey system is imperfect because we cannot continuously track populations and often we can only track one of the two populations. Our reasoning about a predator-prey system could include: Whether this system is in danger of collapse, Whether it’s going to be a peak of population, whether it’s going to be a valley of population, or if we want to change the dynamics of the system in this or that way what we want to do.

Bales Interacting Process Analysis

Many times we cope with the data about people, and the dynamics about people are much more complicated. Bales interaction processes analysis model is a model that talks about how groups of people solve problems face to face. Let us consider Computer Supported Collaborative Ware (CSCW in short). In CSCW research, we use sensors to capture images, and audio signal, and motion signal, and other kinds of signal. From these signals, we are supposed to advise the group how this group can improve the performance of problem solving. In other words, we listen to a group of people speaking unknown language — we look at how they interact because we do not know what they talk about, and in the end we want to give the people some other advice about we can solve problems better. This is a tough problem. In order to solve this problem we will use simulation modeling to help us engage gigabytes of data collected from group discussion and generate reasoning.

Bales Interaction Process is one such model: For a group to solve problems, this group needs to involve different types of tasks —tasks such as asking for facts and giving facts, asking for new directions to explore and giving new directions, voting for opinions, and solving disputes. Different tasks will involve different features: If I give a fact, I may talk longer and the other people may be listening; If a group of person are vote for a direction, there will be a lot of overlapping in their speech signals. If there’s a lot of quarreling,  the pitch and the volume may be higher, and so on.

Hence although we listen some language that we don’t understand, with this model we can understand the dynamics, such as whether a person is giving a fact or whether there is agreement, disagreement going on, and whether a group of person is voting, and who and who are on one side, and who and who are on another side. Then in the end of discussion, although we still do not know the content of discussion, we can count, for example, how many facts have been given in this discussion, and how many voting processes are going on, and whether there are a lot of quarrellings or not, and who is on one side and who’s on other side, and so on and so forth. In the end, we can tabulate different activities in the discussion, and then this effectively shrink the search space from one gigabyte of data to perhaps one page of numbers, and being able to estimate the performance of this group through this one page of numbers will definitely be more reliable.

In this example, the weather system example and the predator prey model, we start from defining a probabilistic model driven by a set of events, proceed to infer the evolution of the system our observations about this system, finally reason about the statistics about this system and reason about introducing intervention. The inference problem is normally much harder, because we involve complex interactions, and we will talk about different ways to solve this problem.

Road Transportation Dynamics

Let us talk about simulation modeling of transportation engineering. Building a road or setting up tolls are not easy, because if things are implemented in the wrong way, there will be a lot of turmoil going on in a city. Also, building a road involves billions of dollars. So we have to first carefully reason about our plan through simulation modeling before we can really implement it.

Transportation engineers start simulation modeling from collecting data. They collect the road network, which is similar to the road network that you use in a GPS navigator. They also collect the population density and typical travel patterns, which are from the census data and travel surveys. In the census data we have different census tracts, and in each tract we have number of households, and the average incomes of the households, and population distribution, and so on. The transportation department calls different people and ask then different questions about daily travel such as: “What is the distance from where you live to where you work?”; “How many trips to you have, and what are the types of trips?”. From this kind of data, we can synthesize the typical trips of different people. Now we have the travel network, which supplies the transportation, and we have the travel demand, which is synthesized from the population surveys.

After we have those two different types of information, we proceed to put this data into a simulator. The simulator will simulate how people move out in the morning, how traffic jams form due to the interaction of many cars on the street, how traffic jams result in detours and increased the cost of commuters in terms of fuel, late arrival to the workspace, and so on. After the simulation, the individual agents in the multi-agent system will try to improve their travel plans this system just like we improve our trips from day to day in reality. The simulator will just go through this process again and again and again until the system reaches equilibrium. At this time, we know how the average people head out to work, and head back from the work, take lunch, and pick up kids, and so on.

This system evolution is defined by a sequence of events. If P is a person, and L is a location, which could be a road or a building, the events are $P L_1 \to P L_2$ with rate of happening being $c$. The events happen at different rates: If the road is long and the traffic is heavy, the rate of moving out of the road will be low; If one direction leads to a more popular place than another direction, there will be a higher probability for people to take the first direction. Again, we’re able to define the dynamics of people in terms of a stochastic process.

Now suppose that we can observe the trajectories of taxi cabs: whether a taxi cab moves slower than previously on a road, or whether there’re more taxi cabs on a road, and so on. We can proceed or engage reasoning in the same ways as we did before. Starting from the number and behavior of tracked vehicles in a road link, we can determine the total number of vehicles in the link by scaling and estimating traffic conditions. If we trace the origins and destinations of the estimated number of vehicles through the behavior of the simulator and fill any gaps with prior individual travel behaviors, we can extract information about the traffic at other road links. If we then iterate estimations between the traffic at links and the trip choices of simulated vehicles, we improve our estimation of both.

By observing how people move, we can infer not only road traffic, but also economical and social indices of large social systems. There are many sectors in our industry:  the education sector the service sector, the energy sector, the banking sector, and so on. People in different sectors move in different ways. Hence by studying how people move around and how people interact with each other, we might be able to estimate the GDP of a whole city, the regions with poverty, the distribution of different industrial sectors, and the formation of a city from the core city to a much larger city.

Nowadays data is ample, and the question is how you can engage the data in your modeling. For example, how can we infer the dynamic collaborator network from the DBLP data set, which is a publication database in the field of computer science? Why people collaborated in the past, who will collaborate in the future, and what will be the hot topics in a scientific field? For example, from the proximity data about computer science students captured with mobile phones, how do we figure out the typical paths through which cold and flu spread and organize campaigns to improve health? For example, from the D4D data set about the hourly locations and phone communications of 12 million people in Senegal, how can we make inferences about the road traffic, the big events at a specific places, the spreading of malaria, and so on.


The learning objective is that at the end of this semester, you will be able to simulate Markov processes. A Markov process is a process that, given the present, we do not have to worry about the past to make inferences about the future. You will be able to make inferences about these different Markov processes by combining simulation models with data, and you’ll be able to reason about dynamics of those processes: What is the average duration of a sunny day; What will be the first time for the prey population to be lower than some threshold? What will be the performance of a group? What is the probability for there to be a traffic jam at the certain places? What are the poorer areas, and what are the richer areas?

To achieve the learning objectives, we’ll talk about how do we simulate the Markov processes, and then we’ll talk about how do we make inference about Markov processes. We’ll talk about, we just mentioned exactly inference today, but most of the time, the processes are pretty complicated. We’ll have to use either MCMC, meaning a sampling or we will use some optimization technique to be able to find optimal approximate solution. Then we’ll talk about the properties of Markov processes, and then we will apply our knowledge to understand the different systems that you may encounter in the future. Most of the systems are systems about people. The tool that we will use in this course is R.

We plan to have four problem sets, and the first problem set will be out by the end of September the 11th, next Friday. I normally give two weeks for you to solve the problems. There will be four problem sets — But first, don’t worry, I think you will all get above B+; Second, if you have problems in finishing the problems, discuss with me. I think hands on experience is pretty important if we want to understand things. If we do not solve the problems.

At the end of the semester we will each submit a term paper, which could be either a literature review of some field that you are interested in related to machine learning and stochastic process, or some exploration about your own projects, or some preparation about the paper that you want to publish next.

Participation. In this semester, I will allocate 10% of the score in participation in the form of contributing  lecture notes to the course web log. I’ll finish the first few, you will do the rest, and I will put all notes together and share them with the whole class at the end of the semester, please register to the course web log so that you can contribute to lecture notes.

In summary, nowadays we have a lot of data, the industry is interested in big data, and academia is interested in big data as well. There are a lot of opportunities for us to understand social systems, chemical systems, biological systems, and so on from these data. The issue here is how we can engage those data and understand our systems and help people. In this course, we will study together on the mathematical tools that we might find useful in the future. I will see you in the next class.

Introduction [09/01/2015]