PS2 Solutions

Kalman Smoothing and Parameter Estimation

We work backward here.

& P(x_{k}|y_{1,\dots,T})\\
= & \sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,T})P(x_{k+1}|y_{1,\dots,T})\\
= & \overset{P(x_{k}|x_{k+1}y_{1,\dots,T})=\frac{P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}{\sum_{x_{k}}P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}=P(x_{k}|x_{k+1}y_{1,\dots,k})}{\sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,k})P(x_{k+1}|y_{1,\dots,T})}\\
= & \sum_{x_{k+1}}\frac{P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}{\sum_{x_{k}}P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}\cdot P(x_{k+1}|y_{1,\dots,T})

Hence to estimate the distribution of $x_{k}$ from $y_{1,\dots,T}$ in Kalman smoothing, we first need to find the joint distribution $P(x_{k+1}|x_{k})P(x_{k}|y_{1,\dots,k})$ just after the prediction step to estimate the distribution of $x_{k+1}$ in Kalman filtering, condition the distribution of $x_{k}$ from predicted value of $x_{k+1}$, join with the distribution of $x_{k+1}$ in Kalman smoothing, and marginalize over the distribution of $x_{k+1}$ in Kalman smoothing:

\end{matrix}\right) & \sim\mathcal{N}\left(\left(\begin{matrix}\hat{x}_{k|k}\\
\end{matrix}\right),\left(\begin{matrix}P_{k|k} & P_{k|k}F_{k+1}^{T}\\
F_{k+1}P_{k|k} & F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}
X_{k|k}|x_{k+1|k} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\
& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)\\
X_{k|k} & =\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right)+\epsilon\\
& \mbox{where }\epsilon\sim\mathcal{N}(0,P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k})\\
X_{k|T} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(\hat{x}_{k+1|T}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\
& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}P_{k|T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}+P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)


  • innovation in smoothing is $\delta=\left(\hat{x}_{k+1|T} -\hat{x}_{k+1|k}\right)$
  • Covariance of innovation is $P_{k+1|k}=F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}$
  • Gain in smoothing is $G=P_{k|k}F_{k+1}^{T}P_{k+1|k}^{-1}$
  • Updated (a posterior) state estimate $\hat{x}_{k|T}=\hat{x}_{k|k}+G\delta$
  • Updated (a posterior) covariance matrix $P_{k|T}=P_{k|k}+G(P_{k+1|T}-P_{k+1|k})G^T$

Parameter Learning

We will estimate $F$, $H$, $Q$ and $R$ from observations $z_1,\dots z_T$ about the following system

  • $X_t = F X_{t-1} + w_t$, where $w_k\sim{\cal N}(0, Q)$ are i.i.d.
  • $Z_t = H X_t + v_t$, where $v_k\sim{\cal N}(0, R)$ are i.i.d. and independent of $w_{t’}$ for all $t’$

Likelihood of $X_{1,\dots,T},Z_{1,\dots,T}$

=&\left(2\pi\right)^{-\frac{d_X}{2}}\left|\Sigma_{0}\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)\right)\\
&\prod_{t=2}^{T}\left(2\pi\right)^{-\frac{d_X}{2}}\left|Q\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)\right)\cdot\\
&\prod_{t=1}^{T}\left(2\pi\right)^{-\frac{d_Z}{2}}\left|R\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)\right)

Expected log likelihood observing $Z_{1,\dots,T}$

\mathbb{E}_{X_{1,\dots,T}}\log P(X_{1,\dots,T},Z_{1,\dots,T})
=& {\rm constant} {-\frac{1}{2}}\log\left|\Sigma_{0}\right|-{1\over 2}\mathbb{E}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)+\\
&\sum_{t=2}^{T}{-\frac{1}{2}}\log\left|Q\right|-{1\over 2}\mathbb{E}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)+\\
&\sum_{t=1}^{T}{-\frac{1}{2}}\log\left|R\right|-{1\over 2}\mathbb{E}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)

Estimating state transition model $F$ and observation model $H$

\frac{\partial}{\partial F_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&Q^{-\mbox{T}}\sum\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}=Q^{-\mbox{T}}F\sum\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}\\
\frac{\partial}{\partial H_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0\Rightarrow&H=\left(\sum_{t=1}^{T}Z_{t}\mathbb{E}X_{t}^{\mbox{T}}\right)\left(\sum_{t=1}^{T}\mathbb{E}X_{t}X_{t}^{\mbox{T}}\right)^{-1}

Solution: write down an equation involving
$\left(\matrix{\frac{\partial\log P}{\partial F_{11}} & \dots & \frac{\partial\log P}{\partial F_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial\log P}{\partial F_{d1}} & \dots & \frac{\partial\log P}{\partial F_{dd}} }\right)$
and solve $F$, where $d$ is the dimensionality of $X$. Apply the same procedure to solve $H$.

Estimating variance of process noise $Q$ and observation noise $R$

\frac{\partial}{\partial Q_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&\left(T-1\right)Q-\sum_{t=2}^{T}\left(\mathbb{E}X_{t}X_{t}^{\mbox{T}}-F\mathbb{E}X_{t-1}X_{t}^{\mbox{T}}-\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}F^{\mbox{T}}+F\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}F^{\mbox{T}}\right)\stackrel{\mbox{set}}{=}0\\
\frac{\partial}{\partial R_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&R=\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}Z_{t}Z_{t}^{\mbox{T}}-H\mathbb{E}X_{t}Z_{t}^{\mbox{T}}-Z_{t}\mathbb{E}X_{t}^{\mbox{T}}H^{\mbox{T}}+H\mathbb{E}X_{t}X_{t}^{\mbox{T}}H^{\mbox{T}}\right)

Solution: Note that the determinant of a matrix is $|A| = \sum_{\sigma\in S_n}{\rm sgn}(\sigma)\prod_{k=1}^d A_{k\sigma_k}$. Hence $\left(\matrix{ \frac{\partial|A|}{\partial A_{11}} & \dots & \frac{\partial|A|}{\partial A_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial|A|}{\partial A_{d1}} & \dots & \frac{\partial|A|}{\partial A_{dd}}}\right)\cdot A^{\mbox{T}}=|A|\cdot I$ and $\frac{\partial\log|A|}{\partial A}=\frac{|A|}{|A|}A^{-T}$.

Schur complement

Our goal is to find inverse of the covariance matrix $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)$. In order to do that, we first use the 2nd column $\left(\matrix{\Sigma_{12}\cr\Sigma_{22}}\right)$ to Gaussian-eliminate $\Sigma_{21}$, which is equivalent to right multiply the covariance matrix with $\left(\matrix{I & 0 \cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$. We then use the resulting 2nd row $\left(\matrix{0 & \Sigma_{22}}\right)$ to eliminate $\Sigma_{12}$, which is equivalent to left multiply the covariance matrix with $\left(\matrix{ I & \Sigma_{12}\Sigma_{22}^{-1} \cr 0 & I}\right)$. We performed the following operation.

& \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)
\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)
\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)\\
= & \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr & I}\right)
\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\cr 0 & \Sigma_{22}}\right)\\
= & \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right).

If we move $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$ and $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)$ to the right hand side of the equation above, then $\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$ and the resulting $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)^{-1}$ to the left hand side of the equation, we get the following.

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)^{-1}
= & \left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)
\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)
\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

Application of Schur complement to find the conditional probability density function of a multivariate normal distribution. Let $x = \left(\matrix{X_1\cr X_2}\right)\sim \mathcal{N}\left(\left(\matrix{\mu_1\cr\mu_2}\right), \left(\matrix{\Sigma_{11} &\Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right) \right)$ be multivariate normal distribution, where $X_1$ and $X_2$ are vectors of dimension $p$ and $q$ respectively, and we want to find the conditional distribution of $X_1$ on $X_2=x_2$.

The probability density function of $X$ is

& f\left(\matrix{x_{1}\cr x_{2}}\right)\\
\propto & \exp\left(-\frac{1}{2}
\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)
\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1}
\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\
= & \exp\left(-\frac{1}{2}
\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)
\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)
\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)
\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)
\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\
=& \exp\left(\left(\begin{matrix}\left(x_{1}-\mu_{1}\right)^{T}-\left(x_{2}-\mu_{2}\right)^{T}\Sigma_{22}^{-1}\Sigma_{21} & \left(x_{2}-\mu_{2}\right)^{T}\end{matrix}\right)
\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)
\left(\matrix{x_{1}-\mu_{1}-\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right)\cr x_{2}-\mu_{2}}\right)\right).

Hence $X_{1}|X_{2}=x_{2}\sim\mathcal{N}\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)$.


Duality between maximum entropy / minimum divergence and maximum likelihood

1. Convexity of $f^∗(y)=\sup_x(y^\mbox{T}⋅x−f(x))$: We want to show that $f^*(\alpha_1*y_1+\alpha_2*y_2)\le \alpha_1*f^*(y_1)+\alpha_2*f^*(y_2)$ for any $0\le\alpha_1,\alpha_2\le 1$ with $\alpha_1+\alpha_2=1$, $y_1$ and $y_2$. In fact we have

f^{*}(\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2}) & = & \sup_{x}((\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2})^{\mbox{T}}x-f(x))\\
& = & \sup_{x}(\alpha_{1}(y_{1}^{\mbox{T}}x-f(x))+\alpha_{2}(y_{2}^{\mbox{T}}x-f(x)))\\
& \le & \alpha_{1}\sup_{x_{1}}((y_{1}^{\mbox{T}}x-f(x)))+\alpha_{2}\sup_{x_{2}}((y_{2}^{\mbox{T}}x-f(x)))\\
& = & \alpha_{1}f^{*}(y_{1})+\alpha_{2}f^{*}(y_{2})\\

2. If $f(x)$ is differentiable in $x$, the supremum $\sup_x y^\mbox{T}\cdot x – f(x)$ is attained at $x^*$ with ${\partial\over\partial x}=0$, i.e., $y-f'(x^*)=0$ or $y$ is the derivative of $f$ at point $x^*$. By definition of supremum, $y^\mbox{T}\cdot x-f(x)\le f^*(y)$. We call $(x^*, f'(x^*))$ Legendre conjugate pair.

3. Double Legendre-Fenchel dual $f^{**}(x)$ as the Legendre-Fenchel dual of $f^*(y)$ is convex as the result of 1. The double Legendre-Fenchel dual of a convex function is itself. Suppose $g(x)\ge f^{**}(x)$ is a convex function below $f$. Then $\max(f^{**},g)$ is a convex function below $f$. Hence double Legendre-Fenchel dual of a function is the maximal convex function below.

4. Let $x=(x_1,x_2,\cdots)$ be the mean parameters of categorical distribution. Then $\phi(x)=x^\mbox{T}\cdot\log x=(\log x)^\mbox{T} \cdot x$ is the entropy of this distribution, $\triangledown \phi(x) = 1+\log(x)$, and
d_{\phi}(y,x) & = & \phi(y)-\phi(x)-\triangledown\phi(x)\cdot(y-x)\\
& = & y^{\mbox{T}}\cdot\log y-x^{\mbox{T}}\cdot\log x-(1+\log x)^{\mbox{T}}(y-x)\\
& = & y^{\mbox{T}}\cdot\log y-y^{\mbox{T}}\cdot\log x\\
& = & d_{\mbox{KL}}(x,y)

5. Since $\phi$ is convex, $\phi(\alpha\cdot x+(1-\alpha)\cdot y)\le\alpha\phi(x)+(1-\alpha)\phi(y)$ for $0\le \alpha\le 1$, or
\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{(1-\alpha)(\phi(y)-\phi(x))}{(1-\alpha)(y-x)}=\frac{\phi(y)-\phi(x)}{(y-x)},\\
\phi'(x)=\lim_{\alpha\to1}\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{\phi(y)-\phi(x)}{(y-x)},\\
d_{\phi}(y,x)=\phi(y)-\phi(x)-\phi'(x)\cdot(y-x) & \ge & 0.

6. For $0\le \alpha_1,\alpha_2\le 1$ and $\alpha_1+\alpha_2=1$,
d_{\phi}(\alpha_{1}y_{1}+\alpha_{2}y_{2},x) & = & \phi(\alpha_{1}y_{1}+\alpha_{2}y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\
& \le & \alpha_{1}\phi(y_{1})+\alpha_{2}\phi(y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\
& = & \alpha_{1}\left(\phi(y_{1})-\phi(x)-\phi'(x)(y_{1}-x)\right)+\alpha_{2}\left(\left(\phi(y_{2})-\phi(x)-\phi'(x)(y_{2}-x)\right)\right)\\
& = & \alpha_{1}d_{\phi}(x,y_{1})+\alpha_{2}d_{\phi}(x,y_{2}).

d_{\phi_{1}+\phi_{2}}(x,y) & = & (\phi_{1}+\phi_{2})(y)-(\phi_{1}+\phi_{2})(x)-(\phi_{1}+\phi_{2})(x)\cdot(y-x)\\
& = & \cdots=d_{\phi_{1}}(x,y)+d_{\phi_{2}}(x,y)

8. $\{x|d_\phi(x,u)=d_\phi(x,v)\}$ is a hyper plane: $d_{\phi}(x,u)=\phi(x)-\phi(u)+\phi'(u)(x-u)=\phi(x)-\phi(v)+\phi'(v)(x-v)=d_{\phi}(x,v)$. Rearranging the terms $\phi(u)-\phi(v)+\phi'(u)\cdot u-\phi'(v)\cdot v-(\phi'(u)-\phi'(v))\cdot x=0$.

9. We expand the expectationon $\mbox{E}_{X}d_{\phi}(X,u)=\mbox{E}\left(\phi(X)-\phi(u)-\phi'(u)(X-u)\right)=\mbox{E}\phi-\phi(u)-\phi'(u)(\mbox{E}X-u)$, and set its partial derivative over $u$ to be 0: $\frac{\partial}{\partial u}\mbox{E}_{X}d_{\phi}(X,u)=-\phi'(u)-\phi”(u)(\mbox{E}X-u)+\phi'(u)=0$. Since $\phi$ is convex, $\phi”\ge0$ and $u=\mbox{E}X$ minimizes expected K-L divergence.

10. $\psi(\theta)$ is convex-conjugate of $\phi(x)$, and $(\theta_1, x_1)$ and $(\theta_2,x_2)$ are conjugate pairs.

d_{\phi}(x_{2},x_{1}) & = & \phi(x_{2})-\phi(x_{1})-\phi'(x_{1})(x_{2}-x_{1})\\
& = & \theta_{2}^{\mbox{T}}\cdot x_{2}-\psi(\theta_{2})-\left(\theta_{1}^{\mbox{T}}\cdot x_{1}-\psi(\theta_{1})\right)-\theta_{1}(x_{2}-x_{1})\\
& = & \psi(\theta_{1})-\psi(\theta_{2})-x_{2}^{\mbox{T}}\left(\theta_{2}-\theta_{1}\right)\\
& = & d_{\psi}\left(\theta_{1},\theta_{2}\right).


H(q)-\sum_{i}\alpha_{i}\left(E_{q}(f_{i})-E_{\tilde{p}}(f_{i})\right) & \leftarrow & \sum_{k}q_{k}\log q_{k}-\sum_{i}\alpha_{i}\left(\sum_{k}f_{i}(x_{k})q_{k}-E_{\tilde{p}}(f_{i})\right)\\
\frac{\partial}{\partial q_{k}} & = & 1+\log q_{k}-\sum_{i}\alpha_{i}f_{i}(x_{k})\stackrel{\mbox{set}}{=}0\\
\Rightarrow q_{k} & \propto & \exp\left(\sum_{i}\alpha_{i}f_{i}(x_{k})\right)\\
& & \mbox{take to the limit.}

12. Given $q(x)=\exp(\theta\cdot g(x)-\Psi(\theta))$, $\log q(x)=\theta\cdot g(x)-\Psi(\theta)$. Setting partial derivative of log likelihood over parameter to 0: $\frac{\partial}{\partial\theta}\sum_{k}(\theta\cdot g(x^{(k)})-\Psi(\theta))=\sum_{k}g(x^{(k)})-N\cdot\frac{\partial}{\partial\theta}\Psi(\theta)\stackrel{\mbox{set}}{=}0$, we get $\mu=\frac{1}{N}\sum_{k=1}^{N}g(x^{(k)})$.

13. To find $\Psi^{*}(\mu)=\sup_{\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)$, we take the derivative of $\mu\cdot\theta-\Psi(\theta)$ over $\theta$: $$\frac{\partial}{\partial\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)=\mu-\underbrace{\frac{\sum_{x}f(x)\cdot\exp(\theta\cdot f(x))}{\sum_{x}\exp\left(\theta\cdot f(x)\right)}}_{\triangledown_{\theta}\Psi=\mbox{E}(f(x))=\mu}\stackrel{\mbox{set}}{=}0.$$
Hence $\mu=\triangledown_{\theta}\Psi$, and $$\Psi^{*}(\mu)=\theta\cdot\triangledown_{\theta}\Psi-\Psi(\theta)=\mbox{E}(\frac{\theta\cdot f(x)}{\Psi(\theta)})=\sum_{x}p(x;\theta)\log p(x;\theta)=-H(\theta).$$

Sample from finite state continuous time Markov process

1. There are 2 events in this system: weather change from sunny to rainy, and from rainy to sunny. Let the rate of the first event be $\alpha$ per day, and the rate of the latter event be $\beta$ per day. The number of events in $\Delta t$ days will be $\alpha\cdot\Delta t$ and $\beta\cdot\Delta t$ for $\Delta t$ small enough. This system is an inhomogeneous Poisson process.

Using first order approximation, the state transition kernel of one-day and one-hour is the following where $\Delta t$ takes value 1 and $1\over 24$ respectively,
\end{matrix}\left(\begin{matrix}1-\alpha & \alpha\\
\beta & 1-\beta
\end{matrix}\right)\cdot\Delta t.$$

Refer lecture slides of 09/01 for sampling in R.

2. Refer lecture slides of 09/01 for EM in R.

3. If we do not have observations, we take $P(Y_t|X_t)=1$. In other words, we do not update latent state distribution.

4. Refer stochastic kinetic model algorithm of 10/01/2015.

PS2 Solutions

Brownian Motion Process: Definition and Properties [11/05]

Prepared by Le Fang, Simranjot Singh and Devesh Mulmule

1. Introduction

Brownian motion is the main process for the calculus of continuous processes, which was described by Botanist R. Brown as the motion of a pollen particle suspended in fluid in 1828. It was observed that a particle moved in an irregular, random fashion. A. Einstein, in 1905, argued that the movement is due to bombardment of the particle by the molecules of the fluid, he obtained the equations for Brownian motion. In 1900, L. Bachelier used the Brownian motion as a model for movement of stock prices in his mathematical theory of speculation. The mathematical foundation for Brownian motion as a stochastic process was done by N. Wiener in 1931, and this process is also called the Wiener process.

The Brownian Motion process B(t) serves as a basic model for the cumulative effect of pure noise. If B(t) denotes the position of a particle at time t, then the displacement B(t) − B(0) is the effect of the purely random bombardment by the molecules of the fluid, or the effect of noise over time t.

2. Defining Properties of Brownian Motion

Brownian motion {B(t)} is a stochastic process with the following properties.

  1. (Independence of increments) B(t) − B(s) for t > s is independent of the past, that is, independent of $ B(u) , 0 \leq u \leq s$, or of $ {F (s)}$ , the ${\sigma-algebra}$ generated by ${B(u), u \leq s}$.
  2. (Normal increments) ${B(t) − B(s)}$ has Normal distribution with mean $ {0}$ and variance ${t − s}$. This implies taking $ {s = 0}$ that $ {B(t) − B(0)}$ has $ {N (0, t)}$ distribution.
  3. (Continuity of paths) $ {B(t), t ≥ 0}$ are continuous functions of $ {t}$.

Sometimes, another version of Brownian Motion, ${W(t)=B(t+s)-B(s)}$, is used, as so called “Wiener Process” or “Standard Brownian Motion”.

3. State Transition Kernel of Brownian Motion
3.1. Transition Probability Functions

If the process is started at $ {x}$, $ {B(0) = x}$, then $ {B(t)}$ has the $ {N (x, t)}$ distribution. More generally, the conditional distribution of $ { B(t + s)}$ given that $ {B(s) = x}$ is $ {N (x, t)}$. The transition function $ {P (y, t, x, s)}$ is the cumulative distribution function of this distribution,

$ \displaystyle P (y, t, x, s) = P(B(t + s) ≤ y|B(s) = x) = P_x (B(t) ≤ y)\ \ \ \ \ (1)$

The density function of this distribution is the transition probability density function of Brownian motion,

$ \displaystyle p_t (x,y) = (1/\sqrt{2\pi t}) e^{-{(y-x)^2/2t}}\ \ \ \ \ (2)$

The finite-dimensional distributions can be computed with the help of the transition probability density function, by using independence of increments.

$ \displaystyle \begin{split}P_x (B(t_1 ) ≤ x_1 , B(t_2 ) ≤ x_2 , . . . , B(t_n ) ≤ x_n ) = \\ \int_{-\infty}^{x_1}p_{t_1}(x,y_1)d{y_1} \int_{-\infty}^{x_2}p_{{t_2}-{t_1}}(y_1,y_2)d{y_2}…\int_{-\infty}^{x_n}p_{{t_n}-{t_{n-1}}}(y_{n-1},y_n)d{y_n}\end{split}\ \ \ \ \ (3)$

3.2. Space and time Homogeneity

It is easy to see that the one-dimensional distributions of Brownian motion satisfy $ {P_0 (B(t) ∈ A) = P_x (B(t) ∈ x + A)}$, where A is an interval on the line. All finite-dimensional distributions of $ {B_x (t)}$ and $ {x + B_0 (t)}$ are the same. Thus $ {B_x (t) − x}$ is Brownian motion started at $ {0}$, and $ {B_0 (t) + x}$ is Brownian motion started at x, in other words

$ \displaystyle B^x (t) = x + B^0(t)\ \ \ \ \ (4)$

This property is called the space homogeneous property of Brownian Motion. Generally, a stochastic process is called space-homogeneous if its finite- dimensional distributions do not change with a shift in space, namely if
$ {P(X(t_1 ) ≤ x_1 , X(t_2 ) ≤ x_2 , . . . X(t_n ) ≤ x_n |X(0) = 0) \\= P(X(t_1 ) ≤ x_1 + x, X(t_2 ) ≤ x_2 + x, . . . X(t_n ) ≤ x_n + x|X(0) = x)}$.

On the other hand, Brownian motion also bears the time homogeneous property as below.

$ {P(B(t) ≤ y |B(s) = x) = P(B(t-s) ≤ y|B(0) = x)}$

Actually, all finite-dimensional distributions of Brownian motion are time homogeneous.

Four realizations of Brownian motion $ {B = B(t)}$ started at 0 are exhibited in Figure 1. Although it is a process governed by the pure chance with zero mean, it has regions where motion looks like it has “trends”.

Figure 1: Four realizations or paths of Brownian motion B(t).
4. Brownian Motion as a Gaussian Process

The covariance function of the process X(t) is defined by

$ \displaystyle \begin{split}\gamma(s, t) = Cov (X(t), X(s)) \\= E (X(t) − EX(t)) (X(s) − EX(s)) \\= E (X(t)X(s)) − EX(t)EX(s)\end{split}\ \ \ \ \ (5)$

Theorem: A Brownian motion is a Gaussian process with zero mean function, and covariance function $ {min(t, s)}$. Conversely, a Gaussian process with zero mean function, and covariance function $ {min(t, s)}$ is a Brownian motion.

Proof: Since the mean of the Brownian motion is zero,
$ {\gamma(s, t) = Cov (B(t), B(s)) = E (B(t)B(s))}$.
If ${t < s}$ then ${B(s)= B(t) + B(s) − B(t)}$, and

$E(B(t), B(s)) = EB^2(t) + E(B(t)(B(s) − B(t)))= EB^2(t) = t$,

where we used independence of increments property. Similarly if $t > s$, $E(B(t)B(s)) = min(t, s)$.

To show the converse, let $t$ be arbitrary and $s ≥ 0$. $X(t)$ is a Gaussian process, thus the joint distribution of $X(t), X(t + s)$ is a bivariate Normal, and by conditions has zero mean. Therefore the vector $\displaystyle (X(t), X(t+s)−X(t)$  is also bivariate Normal. The variables $X(t)$ and $X(t + s) − X(t)$ are uncorrelated, using that $Cov(X(t), X(t + s)) = min(t, s)$, $Cov(X(t), X(t+s)−X(t)) = Cov(X(t), X(t+s))−Cov(X(t), X(t)) = t−t = 0$.

A property of the multivariate Normal distribution implies that these variables are independent. Thus the increment $X(t + s) − X(t)$ is independent of $X(t)$ and has $N (0, s)$ distribution. Therefore it is a Brownian motion.

Example 4.1 Find the distribution of $\int_{0}^{1} B(t) dt$

To determine the distribution of:

$ \displaystyle \int_{0}^{1} B(t) dt$

We can get this by taking limit of distribution of approximating sums

$ \displaystyle \sum B(t_{i})\Delta_{i}$, where $ {\Delta_{i} = t_{i+1} – t_{i}}$

If, for example, $ {t_{i} = i/n}$, then for $ {n=4}$ the approximating sum is $\displaystyle {1/4 (B(1/4) + B(1/2) + B(3/4) + B(1))}$. The distribution of above expression was found to be normal variable with mean 0 and variance 15/32. Similarly, distribution of all of the approximating sums is Normal with zero mean. Also, it can be shown that the limit of Gaussian distributions is a Gaussian distribution. Thus, the mean of required distribution is zero. Its variance is calculated as follows:

$ \displaystyle Var(\int_{0}^{1}B(t)dt) = Cov(\int_{0}^{1} B(t)dt, \int_{0}^{1}B(s)ds)$

$ \displaystyle =E(\int_{0}^{1} B(t)dt, \int_{0}^{1}B(s)ds)$

$ \displaystyle =\int_{0}^{1} \int_{0}^{1}E(B(t) B(s))dt ds$

$ \displaystyle =\int_{0}^{1} \int_{0}^{1}Cov(B(t) B(s))dt ds$

$ \displaystyle =\int_{0}^{1} \int_{0}^{1}min(t,s)dt ds$

$ \displaystyle =1/3$

5.Properties of Brownian Motion Paths
5.1 Quadratic Variation of Brownian Motion

The quadratic variation of Brownian motion Q(t) is defined as:

$ \displaystyle Q(t)=Q([0,t])=lim\sum_{i=1}^{n}|B(t_i^{n})-B(t_{i-1}^{n})|^{2}$

where the limit is taken over all shrinking partitions of [0, t], with

$ \displaystyle \delta_n = max_i (t_{i+1}^{n}-t_i^n )$

becomes 0 as n approaches $ {\infty}$ .

Let the sequence of partitions be such that interval is divide into two and the sub-interval also divided to two and so on.

Let $ {T_n = \sum_{i=1}^{n}|B(t_i^{n})-B(t_{i-1}^{n})|^{2}}$

$ \displaystyle E(T_n) = E(\sum_{i=1}^{n}|B(t_i^{n})-B(t_{i-1}^{n})|^{2})$

$ \displaystyle =\sum_{i=1}^n (t_i^{n} – t_{i-1}^{n})$

$ \displaystyle = t-0 = t$


$ \displaystyle Var(T_n) = Var(\sum_{i}|B(t_i^{n})-B(t_{i-1}^{n})|^{2})$

$ \displaystyle = \sum_{i}Var(B(t_i^{n})-B(t_{i-1}^{n}))^{2}$

$ \displaystyle = \sum_{i}3(t_i^{n} – t_{i-1}^n)^{2}$

(since fourth moment of N(0,$ {\sigma^2}$) is $ {3\sigma^4}$)

$ \displaystyle \leq \sum_{i}3max(t_i^{n} – t_{i-1}^n)t$

$ \displaystyle =3t\delta_n$

Hence, $ {\sum_{i=1}^n Var(T_n) \leq \infty}$. Using monotone convergence theorem, we find $ {E(\sum_{i=1}^n (T_n-E(T_n))^{2} \leq \infty}$. This implies that the series inside the expectation converges almost surely. Hence its terms converge to zero, i.e. $ {T_n-E(T_n)}$ becomes 0 almost surely, consequently ${T_n}$ tends to t almost surely.

5.2 Properties of Brownian motion

B(t)’s as functions of t have the following properties. Almost every sample path B(t), $ {0 \leq t \leq T}$

1. is a continuous function of t;
2. is not monotone in any interval, no matter how small the interval is;
3. is not differentiable at any point;
4. has infinite variation on any interval, no matter how small it is;
5. has quadratic variation on [0, t] equal to t, for any t.

Properties 1 and 3 of Brownian motion paths state that although any B(t) is a continuous function of t, it has increments $ {\Delta B(t)}$, over a time interval $ {\Delta t}$, much larger than $ {\Delta t}$, as $ {\Delta t}$ tends to 0.

For property 4, a continuous function with finite variation has 0 quadratic variation.

For property 2, a monotone function has finite variation.

6. Three Martingales of Brownian motion

A stochastic process {X(t), $ {t \geq 0}$} is a martingale if
1) for any t it is integrable, $ {E|X(t)| < \infty}$,
2) for any $ {s > 0}$,

$ \displaystyle E(X(t + s)|F(t)) = X(t)$

where the F(t) is the information up-till time t and equality holds almost surely. In other words, if we have the values up-till time t and $ {X(t)=x}$, then expected value of future values is x.

Let B(t) be any Brownian motion. Then

1. B(t) is a martingale.
2. $ {B(t)^2-t}$ is a martingale.
3. For any u, $ {e^{uB(t)-(u^2/2)t}}$ is a martingale.

Proof of the above properties:

Note: The key idea to establish the martingale property of Brownian motion is using the independent increment property, which implies

$\displaystyle E(g(B_{t+s}-B_t)|F_t)=E(g(B_{t+s}-B_t))$. Here take the proof of second martingale above as an example.

$ \displaystyle B(t+s)^2=(B(t)+B(t + s)-B(t))^2$

$ \displaystyle =B(t)^2+2B(t)(B(t+s)- B(t))+(B(t+s)-B(t))^2$

Now, $ {E(B(t+s)^2|F(t)))}$

$ \displaystyle =B(t)^2+2E(B(t)(B(t + s)- B(t))|F(t))+E(B(t+s)-B(t))^2|F(t))$

$ \displaystyle =B(t)^2+s$

Subtracting (t+s), we get $ {B(t)^2-t}$ is a martingale.

7. Markov Property of Brownian Motion & Stopping time
7.1 Stopping Time

A random time $T$ is called a stopping time for $B(t)$, where $t\ge 0$, if for any $t$ it is possible to decide whether $T$ has occurred or not by observing $B(s)$ for $0\le s\le t$. More rigorously, for any $t$ the sets $\{T\le t\}\in F_t$, the $\sigma$-algebra generated by $B(s)$ for $0\le s\le t$.


– Any non-random time $T$ is a stopping time. $\{T\le t\}$ is either $\emptyset$ or $\Omega$.
– The first time B(t) takes value 1, $\min_t B(t)=1$, is stopping time.

$ \displaystyle {\{ T\le t\}= \{B(u)<1}$ , for all $ u\le t\}\in F_t $.
– The time Brownian motion reaches maximum in $[0,1]$ is not stopping time.
– The time Brownian motion last crosses 0 is not stopping time.

7.2 Markov property and strong Markov property

Brownian motion has Markov property:

$\displaystyle P(B(t+s)\le y|F_t)=P(B(t+s)\le y|B(t))$

Proof: use exponential Martingale:

$ E(\exp(u\cdot B(t+s))|F_t) = \exp(u\cdot B(t))\cdot \mathbf E(\exp(u(B(t+s)-B(t)))|F_t)
= \exp(u\cdot B(t))\cdot \mathbf E(\exp(u(B(t+s)-B(t))))$

Brownian motion has strong Markov property: for any finite stopping time $T$, $\displaystyle P(B_{T+t}|F_T)=P(B_{T+t}|B_T)$.

The strong Markov property is similar to the Markov property, except that in the definition a fixed time t is replaced by a stopping time T. The proof of the strong Markov property can be done using the exponential martingale and the Optimal Stopping Theorem.

8. Hitting Time and Exit Time

Hitting time $\tau_x$ denotes the first time $B(t)$ hits level $x$: $\tau_x=\inf\{t>0:B(t)=x\}$
Exit time $\tau$ denotes the first time $B(t)$ exit an interval $(a,b)$: $\displaystyle \tau=\min(\tau_a,\tau_b)$

Claim: Consider a Brownian motion process $B_x(t)$ starting from $a<x<b$ and an exit time $\tau=\min(\tau_a,\tau_b)$. We have $P_x(\tau<\infty)=1$ and $E_x\tau<\infty$

Corollary: $P_a(\tau_b<\infty)=1$ and $P_a(\tau_a<\infty)=1$.

8.1 Distribution of Hitting Time

Let the maximum and minimun of Brownian motion $B_0(s)$ starting from 0 on $[0,t]$ be $M(t)=\max_{0\le s\le t} B(s)$ and $m(t)=\min_{0\le s\le t} B(s)$.

Let $\tau_x=\inf\{t>0: B(t)=x\}$. Then for any $x>0$, we have $\displaystyle P_0(M(t)\ge x)=2\cdot P_0(B(t)\ge x)=2(1-\Phi({x\over\sqrt t}))$.

Proof: Note the equivalence of two events: $\{M(t)\ge x\}=\{\tau_x\le t\}$, and that $P(B(t)\ge x)=P(B(\tau_x+(t-\tau_x))-B(\tau_x)\ge 0, \tau_x\le t)=P(\hat B(t-\tau_x)>0)\cdot P(M(t)\ge x)$ by strong Markov property.

Corollary: $P(B(t)\le 0, \forall 0\le t\le 1)=0$

The probability density of $\tau_x$ is inverse gamma density $\displaystyle f_{\tau_x}={|x|\over\sqrt{2\pi}}t^{-{1\over2}}\exp(-{x^2\over 2t})$, and $E \tau_x=\infty$. Hence although $x$ is visited with probability 1, the expected time for $x$ to be visited is infinite.

Proof: Note the equivalence of two events: $\{M(t)\ge x\}=\{\tau_x\le t\}$. Hence $P(\tau_x\le t)=P(M(t)\ge x)=2P(B(t)\ge x)=\dots$. Differentiate over $t$.

9 Reflection Principle

Let $T$ be a stopping time. Define $\hat B(t)=B(t)$ for $t\le T$ and $\hat B(t)=2B(T)-B(t)$ for $t\ge T$. Then $\hat B$ is also a Brownian motion. Below, Figure 2 shows this principle.

Figure 2: Reflection Principle of Brownian Motion

Fima C. Klebaner (2005). Introduction to Stochastic Calculus with Applications. Imperial College Press.

Peter Tankov and Rama Cont (2003). Financial Modelling with Jump Processes. Chapman & Hall/CRC Financial Mathematics Series. On Course web site

Brownian Motion Process: Definition and Properties [11/05]

Brownian Motion Calculus: Defining Integration & Differential of Brownion Motion [11/10]

Ito Integral on a Non-Random Simple Process

Let us take an example. Let’s consider a time period [0, T ] partitioned into n intervals of equal length $\Delta t = T/n$ with endpoints $ t_{k}= k\Delta t,$ k = 0, …, n. These are the times at which the market is open and trade in shares can take place. An investor buys $q(t_{0})$ shares at time 0 at a price of $S(t_{0})$ each. At time $t_{1}$, market trading establishes a new share price $S(t_{1})$. Once this price has been revealed the investor can change the quantity of shares held, from $q(t_{0})$ to $q(t_{1})$. The same at the subsequent times $t_{2}$ through $t_{n−1}$. At time $t_{n}$ the entire portfolio is liquidated as soon as share price $S(t_{n})$ becomes known. A portfolio of $q(t_{k})$ shares is held from just after the share price $S(t_{k})$ has been revealed, to when the share price $S(t_{k+1})$ at time $t_{k+1}$ becomes known. This portfolio is worth $q(t_{k} )S(t_{k})$ at time $t_{k}$ and $q(t_{k})S(t_{k+1})$ at time $t_{k+1}$. So from time $t_{k}$ to time $t_{k+1}$, with the change in share price $S(t_{k+1})$ − $S(t_{k})$ denoted by $S_{k}$, the portfolio value changes by $q(t_{k}) S_{k}$. The gain over all trades can then be expressed as $I_n = \sum_{k=0}^{n-1} q(t_{k}) \Delta S_{k}$. This expression is an example of a discrete stochastic integral.

At each trading time, the quantity of shares to be held until the next trading opportunity arises, is decided by the investor. Some investors will buy shares and others will sell as a trade requires two parties. The factors which an investor takes into account in determining the shareholdings are unknown, so q is a discrete-time random process. Thus the stochastic integral is constructed from the two random processes q and S:

$$\sum_{k=0}^{n-1} q(t_k )[ \mu S(t_k) \Delta t + \sigma_S(t_k ) \Delta B(t_k )]
= I_n$$

Increasing n makes the time step t smaller and discrete-time trading approaches continuous-time trading. The first term is a summation with respect to time step $\Delta t$. The second term is a summation with respect to Brownian motion increment [B($t_{k+1}$) − B($t_{k}$ )].

Let’s take time period [0, T] partitioned into n intervals of equal length t = T/n, with endpoints $t_{k}$ = $k\Delta t$, k = 0, . . ., n. The simplest integrand f is a step-function whose values are non-random, so the only random quantity is the Brownian motion integrator. Step-function f can be written with indicator notation as
$$f = f(t_{0}) 1_{[t_0,t_1)} + · · · + f (t_{k} ) 1_{[t_k ,t_{k+1})} + · · · + f (t_{n−1}) 1_{[t_{n−1},t_n )}$$

where $1_{[tk ,tk+1)}$ denotes the indicator function which has value 1 on the interval shown in the subscript, and 0 elsewhere. Define the discrete stochastic integral $I_{n}( f )$:

$$I_n( f ) = \sum_{k=0}^{n-1}f(t_{k} )[B(t_{k+1}) − B(t_{k} )]$$

Therefore, integral of a non-random process over Brownian motion is the limit of the integrals of non-random simple processes approaching this non-random process.

Ito Integral on a Simple Adapted Process

Let f be at random level $f(t_{k}, ω)$ for $t_{k}$ < t < $t_{k+1}.$ When the history of the Brownian motion process becomes known progressively at each time $t_{k}$, it must be possible to determine $f (t_{k})$ from this history alone. The current terminology is that $f (t_{k})$ must be adapted to the filtration ($t_{k}$ ).

  • Linearity:$\int_0^T (\alpha X(t) +\beta Y(t)) d B(t) = \alpha \int_0^T X(t) d B(t) + \beta \int_0^T Y(t) d B(t)$ for constant $\alpha$ and $\beta$ and simple adapted processes $X(t)$ and $Y(t)$
    $\int_0^T 1_{(a,b]} d B(t) = B(b)-B(a)$, and
    $\int_0^T 1_{(a,b]} X(t) d B(t) = \int_a^b X(t) d B(t)$
  • Zero mean property
    $\mathbf E\int_0^T X(t) d B(t) = 0$
  • Isometry property:
    $\mathbf E\left(\int_0^T X(t) d B(t)\right)^2=\int_0^T \mathbf E(X^2(t)) dt$

Ito Integral of Regular Adapted Processes

After looking at the specific case of simple adaptive process, we want to find the Ito integral for general integrands i.e. any regular adapted process. So let’s assume $X^n(t)$ to be a sequence of simple processes convergent in probability to the process $X(t)$. $$ X^{(n)}(t)\rightarrow X(t)$$ Then, under some conditions, the sequence of their integrals $\int_0^TX^n(t)dB(t)$ also convergence in probability. Hence, if, $$\int_0^TX^n(t)dB(t)\rightarrow J^n$$
Then $$\int_0^TX(t)dB(t)\rightarrow J$$
For Example:
Find $\int_0^T B(t)dB(t)$:
let’s assume $0=t_0^n<t_1^n<t_2^n< … <t_n^n = T$ be a partition of [0, T] and let- $$X^n(t) = \sum_{i=0}^{n-1} B(t_i^n)I_{(t_i^n, t_{i+1}^n)}(t)$$
Also $\delta_n = max_i(t_{i+1}^n – t_i^n)\rightarrow 0$ then $X_n(t)\rightarrow B(t)$ is approximately the continuity of the Brownian paths.
Hence, $$\int_0^TX^n(t)dB(t) = \sum_{i=0}^{n-1} B(t_i^n)(B(t_{i+1}^n)- B(t_{i+1}^n))$$
where, $$B(t_i^n)(B(t_{i+1}^n)- B(t_{i+1}^n))=1/2(B^2(t_{i+1}^n) – B^2(t_{i}^n) – (B(t_{i+1}^n) – B(t_{i}^n))^2)$$
Therefore, $$\int_0^TX^n(t)dB(t)$$
$$ = \sum_{i=0}^{n-1} 1/2(B^2(t_{i+1}^n) – B^2(t_{i}^n) – (B(t_{i+1}^n) – B(t_{i}^n))^2)$$
$$=1/2(B^2(T) – 1/2(B^2(0)) -1/2 \sum_{t=0}^{n-1}(B(t_{i+1}^n) – B(t_i^n))^2$$
Here, the last summation part converges to the quadratic variation T of Brownian motion on [0, T] in probability.
Therefore $\int_0^T B(t)dB(t) = lim \int_0^T X^n(t)dB(t) = 1/2B^2(T) – 1/2T$

Any regular adaptive process has the same properties as we already discussed for the simple adaptive process.

  • Linearity:
    $\int_0^T (\alpha X(t) +\beta Y(t)) d B(t) = \alpha \int_0^T X(t) d B(t) + \beta \int_0^T Y(t) d B(t)$
    for constant $\alpha$ and $\beta$
  • $\int_0^T 1_{(a,b]} X(t) d B(t) = \int_a^b X(t) d B(t)$
  • Zero mean property: $\mathbf E\int_0^T X(t) d B(t) = 0$
  • Isometry property: $\mathbf E\left(\int_0^T X(t) d B(t)\right)^2=\int_0^T \mathbf E(X^2(t)) dt$

Ito Integral Process


(Equation A)


(Equation B)

In simple terms, ito intergal is a way of integrating a stochastic process variable with respect to another stochastic process that result into a new stochastic process. Here, X(s) is a square-integrable process (hence the equation B) that is adapted to the filteration generated by B(s). Standard calculous doesn’t apply to such functions since the function X(S) is not differentiable at any point and has infinite variateion over every time interval.

Ito integral plays crucial part in mathemetical finance to evaluate strategies to make descisions over time variable stochastic processes or Brownian processes such as stoch prices.

  • Martingale Property: Probabilities of current state doesn’t depend on any knowledge of past events, given that it’s square is inegral is finite. I.e. Expected value of the state Y at time t doesn’t depend on the given $F_s$. And hence it is a stochastic variable Y(s) only.
    $$E(Y(t)|F_s) = Y(s)$$
  • Quadratic Variation: The quadratic variation of an Ito processes can be given by $\int_0^t X^2 (s)ds$

Ito’s Formula for Brownian motion

We know that B(T) is a Brownian motion from 0 to T and since we know the first order derivative and second order derivate of F then we can find out the value of Brownian motion at time T as in the example below.
Example: \[f(B(t)) = f(0)+\int_0^t f'(B(s)) d B(s) + {1\over 2}\int_0^t f”(B(s)) d s\]
where: \[\exp(B(t)) = 1+\int_0^t \exp(B(s)) d B(s) + {1\over 2}\int_0^t \exp(B(s)) d s\]
Proof of the above example is:
\[f(B(t)) = f(0) + \sum_{i=0}^{n-1} \left(f(B(t^n_{i+1}))-f(B(t^n_{i}))\right)\]
After taking the first order derivative,
\[\hphantom{f(B(t)) } = f(0) + \sum_{i=0}^{n-1} f'(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right) + {1\over 2}\sum_{i=0}^{n-1} f”(\theta^n_{i})\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\] For \[\theta^n_{i}\in \left(B(t^n_{i}),B(t^n_{i+1})\right)\]
\[\sum_{i=0}^{n-1} f'(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right) \to \int_0^t f'(B(s)) d B(s)\]
After taking the second order derivative and as $n\to\infty$ we have the below equations
\[\sum_{i=0}^{n-1} f”(\theta^n_{i})\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\to \sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\]
In probability as $n\to\infty$
\[\sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right)^2 \to \sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(t^n_{i+1}-t^n_{i}\right)\]
\[\sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\to \int_0^t f”(B(s)) d s\]

Ito’s Process and Stochastic Differentials

We know B(t) is a Brownian motion from $0<t<T$. An Ito-process Y(t) from 0≤t≤T is given by
\[Y(t) = Y(0) + \int_0^t \mu(s) d s + \int_0^t \sigma(s) d B(s)\] or
\[d Y(t) = \mu(t) d t + \sigma(t) d B(t)\]

where $0\le t\le T$, $Y(0)$ is $F_0$-measurable, $\mu(t)$ and $\sigma(t)$ are $F_t$-adapted, $\int_0^T|\mu(t)| d t<\infty$ and $\int_0^T\sigma^2(t) d t <\infty$.
Example: Taylor’s formula written at the second order for Brownian motion reads as:
Chain rule: $d(f(B(t)))=f'(B(t)) d B(t) + {1\over 2}f”(B(t)) d t$, for $f$ with continuous 2nd derivative:
$d \exp(B(t)) = \exp(B(t)) d B(t) + {1\over 2} \exp(B(t)) d t$
$d B^2(t) = 2 B(t) d B(t) + dt$

Ito’s Formular for Ito processes

Let X be an Ito process with stochastic differential then
$d X(t) = \mu(t) d t + \sigma(t) d B(t)$ for 0 ≤ t ≤ T.
Let ($f\in C^2$) then the stochastic differential of the process f exists and can be written as
df(X(t)) &=& f'(X(t))dX(t)+\frac{1}{2}f”(X(t))d[X,X](t)\\
&=& f'(X(t))dX(t)+\frac{1}{2}f”(X(t))\sigma^{2}(t)dt\\
&=& \left(f'(X(t))\mu(t)+\frac{1}{2}f”(X(t))\sigma^{2}(t)\right)dt+f'(X(t))\sigma(t)dB(t)

Ito’s Integral and Gaussian process

If x is a non- random process satisfying $\int_0^T X^2(s) d s <\infty$ and Y is a Gaussian process where $Y(t)=\int_0^t X(s) d B(s)$ Then it hass a zero mean and the covariance can be written as below:
$\mbox{Cov}(Y(t), Y(t+u)) = \int_0^t X^2(s) d s$, $t,u\ge 0$

Brownian Motion Calculus: Defining Integration & Differential of Brownion Motion [11/10]

The Theory of Speculation [11/12]

The Theory of Speculation

The focus of Bachelier’s work (The Theory of Speculation) was to establish the probability law for the price fluctuations that the market admits at this instant. Indeed, while the market does not foresee fluctuations, it considers which of them are more or less probable, and this probability can be evaluated mathematically.

There are innumerable influences on the movements of stock exchange.

Stock Exchange acts upon itself, implies: its current movement is a function not only of earlier fluctuations, but also of the present market position.

The determination of these fluctuations is subject to an infinite number of factors: it is therefore impossible to expect a mathematically exact forecast.



Smiley face
(An excellent translation of the original paper:


A few important terms defined

Spot Price: The current price at which a particular security can be bought or sold at a specified time and place. (A securities futures price is the expected value of the security.)

Futures Contract: A contractual agreement to buy or sell a particular commodity or financial instrument at a pre-determined price in the future.

Option: offers the buyer the right, but not the obligation, to buy (call) or sell (put) a security or other financial asset at an agreed-upon price (the strike price) on a specific date (exercise date).

Mathematical advantage: Mathematical expectation does not provide a coefficient representing, in some sense, the intrinsic value of the game, it only logically tells whether a game will be profitable or not, Hence the concept of Mathematical advantage.

We can understand mathematical advantage from the example of a gambler’s odds as explained in the paper by Louis Bachelier: Define the mathematical advantage of a gambler as the ratio of his positive expectation and the arithmetic sum of his positive and negative expectations. Mathematical advantage varies like probability from zero to one, it is equal to 1/2 when the game is fair.

Mathematical expectation of gain / (mathematical expectation of gain + absolution value of mathematical expectation of loss).

The mathematical expectation of the speculator is zero


The General Form of probability curve

The probability that a price y be quoted at a given epoch is a function of y.

The price considered by the market as the most likely is the current true price.

Prices can vary between −xo and +∞: xo being the current absolute price.

It will be assumed that it can vary between −∞ and +∞. The probability of a spread greater than xo being considered a priori entirely negligible.

The probability of a deviation from the true price is independent of the absolute level of this price, and that the probability curve is symmetrical with respect to the true price.

Only relative prices will matter because the origin of the coordinates will always correspond to the current true price.


Probability law of Financial Product

Method 1:

  • px,tdx: probability for price to be in the interval [x,x+dx) at time t
    px,tdx denotes the probability that, at epoch t, the price is to be found in the elementary interval x, x + dx.


The desired probability is therefore
p_{x,t_1}p_{z-x,t_2} dx dz



  • Markov property: pz,t1+t2dz = −∞px,t1pz−x,t2dxdz
    Because at epoch t1, the price could be located in any the intervals dx between −∞ and +∞, so this is the probability of the price z being quoted at epoch t1 + t2.
  • Homogeneous Gaussian process:

p_{x,t} = {1\over 2\pi k\sqrt t}\exp\left(-{1\over 4\pi k^2 t}\right)

Where \[k=\int_0^\infty p_{x,1} dx\] is the gain at time 1.

Method 2:  Alternate derivation of the Probability Law

  • conduct m Bernoulli trials in [0,t), with success rate p

Suppose that two complementary events A and B have the respective probabilities p and q = 1 – p. The probability that, on m occasions, it would produce α equal to A and m − α equal to B, is given by the expression.
\[{P}(\alpha; m,p) = {m\choose \alpha}p^\alpha q^{m-\alpha}\]


  • mathematical expectation of gain

\[ = pq\cdot\left(\partial_{p}\sum_{h\ge0}\mbox{P}(mp+h;m,p)-\partial_{q}\sum_{h\ge0}\mbox{P}(mp+h;m,p)\right) \]
\[= \frac{m!}{(mp)!(mq)!}p^{mp}q^{mq}mpq\]


The quantity h is called the spread.
Let us seek for the mathematical expectation for a gambler who would receive a sum equal to the spread whenever this spread be positive.
We have just seen that the probability of a spread h is the term from the expansion of (p + q)m in which the exponent of p is mp + h, and that of q is mqh.

To obtain the mathematical expectation corresponding to this term, it is necessary to multiply this probability by h. Now,


h = q(mp + h) − p(mqh)

mp + h and mqh being the exponents of p and q in a term from (p + q)m.


  • Apply Stirling’s formula
    \[n!=e^{-n}n^n\sqrt{2\pi n}\]

\[{E}={dh\over \sqrt{2\pi m p q}}\exp\left(-{h^2\over 2 m p q}\right)\]

The Theory of Speculation [11/12]

Cities and Complexity [10/27]

Prepared by Avinav Sharan, Himanshu Sharma and Gaurav Kshirsagar

Rank-Size Population Distributions And Dynamics of Cities

We analyse how the shift in population of a given city happens with time given their ranks. We try to rank them by their population size.\\\\
We define some parameters to represent the population and their ranks. \\

$P_{1}(t),…..,P_{N}(t)$: populations of N cities at time t
$r_{1},….,r_{N}$: population ranks of cities in decreasing
$P_{i,r_{i}=1}≥P_{j,r_{j}=2}≥P_{m,r_{m}=N}$: population sorted in decreasing order
$\Gamma(t+1)=\sum_{i}|p_{i,r_{i}(t+1)}(t+1) – p_{i,r_{i}(t)}(t)|$: percentage shift in population, where $pi,r=P_{i,r}/\sum_{i}P_{i,r}$
$\wedge(t+1)=\sum_{i}|r_{i}(t+1)-r_{i}(t)|/N$ percentage shift in rank order

insert image 1
insert image 2

We plotted the $\log(rank)$ vs $\log(population)$ for 458 standardized administrative areas from 1901 to 1991. We can see that it shows a power law distribution between rank and population. The curve is very stable as we can see from the plot. As 1901 and 1901 the curve is similar with same coefficient even with population of the cities changes.\\
How can we explain this above pattern based on microscopic agent behaviour and interaction. If we can explain then we can define the model for population development:\\

insert image 3

We will try to explain one of the multiple interpretation.
let increment in population is some random variable with Gaussian distribution.\\
If we have the initial state (population at time t = 0, $P_{i_{0}}$) and increment of population at time t = 0 is $\epsilon_{0}$ and for city 1 increment is $\epsilon_{1}$ and so on.

$\log P_{i}(t) = \log P_{i}(0) + \sum_{\tau=0}^t \epsilon_{i}(\tau)$

Also assume mean is $\mu$ and variance is $\sigma^2$, than we get the log normal distribution. (meaning logarithm of the population is normal distribution)

Now simulate the same number of cities, 458 cities and we start from a uniform population distribution. In each city we simulate the population growth according to Gaussian distribution. We sample the scaling factor of all of the 458 cities and we simulate from 991 AD to 1991 and what we find out is indeed that we have some kind of nice straight line of the log population shear versus log ranks. It is one feasible interpretation, hypothesis about city development.

Explaining city growth with Ants Model

How can we explain the above model (growth in population of cities)?

insert image 4

We use this agents based model to explain the growth of cities.
In the picture we have nest in the center, and three food sources. In starting ants move around randomly until they reach the food source. They carry the food and leave the chemical trails while returning for the other ants. So that they can find the food source easily. We can see the simulation of this model in the NetLogo tool.\\
We can identify many things with this model. First is that nearest food resource will be probabilistically explored first and furthest will be explored last.
Second, whenever ant find it and they will leave trails. And It will have large concentration of the chemicals. So we can relate this ant model with the behaviour of the people.

Formulation of Ants Model

Following random variables represent the Ants model:

Position of any ant at time t : $A(x_i(t),y_i(t))$
Home of ant i: $A(X_i,Y_i)$
Position of resource k : $R(X_k, Y_k)$
The chemical concentration of scent at point (x,y) : $\psi(x,y)$

There are few things to note in the above representation. ‘Home of ant’ does not vary with time. The chemical concentration of scent is a function of time.

When the ant is moving back to home with food, it leaves the trail of chemical scent. This is represented by the following equation:
\psi(x_i(t+1),y_i(t+1)) = \psi(x_i(t),y_i(t)) + \phi

This equation says that concentration of chemical scent depends on the same at the postion in previous time step, with some added noise $\phi$.

Generally, we can represent the ants motion as the following:
x_i(t+1) = x_i(t) + \Delta x_i(t) + \epsilon_x(t)
y_i(t+1) = y_i(t) + \Delta y_i(t) + \epsilon_y(t)
That is, the position of ant depends on the gradient of previous time step plus some added noise, $\epsilon$.
The gradient, $\Delta x$ or $\Delta y$ depends on the two cases:

When the ant is seeking food: We know that the ant is trying at random (given by $\epsilon(t)$, but following any chemical trail if it finds. This is intuitively represented by the equation:
(\Delta x_i(t), \Delta y_i(t)) = c \cdot \nabla \psi(x_i(t),y_i(t)) + \epsilon(t)

When the ant is moving food back: We know that ant follows the path that is dependent on the ant’s home. This is given by the equation:
(\Delta x_i(t), \Delta y_i(t)) = (f(X_i),f(Y_i))

This formulation let us think the ant model probabilistically.

Simulation and Observation of Ants Model

First we simulate a \textbf{simplified ants model, with one home (at origin) and one food center} (represented by a square box). In the image below we can observe three time stages. Intitially (at t=0), we observe that the ants move in random directions. There are no definite permanent tracks yet.\par
As we skip to time, t=200, we observe that ants have found the food and the track has started to emerge. But the ants are still randomly distributed in the space.\par
Finally, at time, t=2000, we observe that the track is well defined and the ants have started to cluster around this track.\linebreak

Insert image 5

Next we simulate an \textbf{ant model with random homes and multiple resources}. In the below image, we can clearly observe one of the main properties of cities, that is roads. We can observe star like structures (roads) between homes and resources after sufficient number of iterations.

Insert image 6

Next let us make this formulation more interesting and complex. We add following features instead of resources: ‘Industry’ ($I_{xy}$) and ‘Service’ ($S_{xy}$). We also dynamically add population ($P_{xy}$) to the model. ‘Service’ is dependent on ‘Industry’. The population is dependent of ‘Service’ and ‘Industry’. We can model any kind of interaction between Industry, Service and Population.

The process of the model is defined by:
The Position at any time, depends on previous time step population, the diffusion of population ($\nabla^2P_{xy}(t)$) (laplacian) at previous time step, the interaction of industry and service ($o_{xy}^{(k)}(t)$) and random noise ($\epsilon$).
P_{xy}(t+1) = P_{xy}(t) + \theta^{(P)}\nabla^2P_{xy}(t) + \Sigma_k o_{xy}^{(k)}(t) + \epsilon_{xy}^{(P)}(t)
where, $\theta$ is diffusion constant
Similarly, we can represent the denisty of Indurtrial resouce and Service:
I_{xy}(t+1) = I_{xy}(t) + \theta^{(I)}\nabla^2 I_{xy}(t) + \Sigma_k o_{xy}^{(k)}(t) + \epsilon_{xy}^{(I)}(t)
S_{xy}(t+1) = S_{xy}(t) + \theta^{(S)}\nabla^2 S_{xy}(t) + \Sigma_k o_{xy}^{(k)}(t) + \epsilon_{xy}^{(S)}(t)
In the image below we can observe the simulation. Intially, we see the formulation of nuclear cities (at t=100). We observe cluster of population which represent cities. The log-size of the ‘cities’ is linear with log-rank, which we had seen in the british cities data from 1901 to 1991. We also observe the formation of roads.

Insert image 7

SandPile model

Let us perform the following experiment:
Drop sand uniformally onto a table. Gradually, sandpile height increases, and the sand starts to fall in surrounding areas. The key observation is that sand fall can trigger more sand fall and an ‘avalanche’ will occur.
In Sandpile Model we try to model the aboce experiment. Model Representation:
X_{i,j} \geq 4: X_{i,j} = X_{i,j} – 4,
X_{i \pm 1,j} = X_{i \pm 1,j} + 1,
X_{i,j \pm 1} = X_{i,j \pm 1} + 1,
where $X_{i,j}$ is the height of the sand pile.

NetLogo Demo:

You can play with it, and observe that small avalanch occur more often than large ones. We study Sandpile model as it is related to many real-life scenarios. For instance, the migration of the population from one city to another can be said to follow sand pile model. Similarly, we can observe it in forest fires, frequency and size of earthquakes, etc.

Size of Avalanche based on SandPile model

Here we are using the SandPile model to understand the working of avalanches.
Attempts to model avalanche behaviour date from the early 20th century, notably the work of Professor Lagotala in preparation for the 1924 Winter Olympics in Chamonix. His method was developed by A. Voellmy and popularised following the publication in 1955 of his Ueber die Zerstoerungskraft von Lawinen (On the Destructive Force of Avalanches). Voellmy used a simple empirical formula, treating an avalanche as a sliding block of snow moving with a drag force that was proportional to the square of the speed of its flow:

Force = 1/2 * \rho \nu^2

Avalanche increases in the following manner having a relation to the previous state
A + \phi \rightarrow B
B + \phi \rightarrow C
C + \phi \rightarrow D
D + \phi \rightarrow p_1 D + \phi, p_2 C + 2\phi, B + 3\phi, A + 4\phi

The kinetic equations for this growth are:
\dot{n_A} = n_\phi (p_4 n_D – n_A)
\dot{n_B} = n_\phi (p_3 n_D + n_A – n_B)
\dot{n_C} = n_\phi (p_2 n_D + n_B – n_C)
\dot{n_D} = n_\phi (p_1 n_D + n_C – n_D)
In general,
\dot{n_\phi} = n_\phi (\bar{p} n_D – 1) + \bar{p}\nu\nabla^2 (n_\phi n_D) + \eta(r,t)
\bar{p} = p_1 + 2p_2 + 3p_3 + 4p_4
\nu = diffusion coefficient
\eta = noise

Insert image 8

Avalanches in 2D sandpile model.The dark regions show the size
of avalanches triggered by a single addition.

Though at first glance, it seems that the number of topplings and the area both measure the number of affected sites, a single site may topple more than once in a single avalanche, and hence the two are truly different variables. The duration of the avalanche is defined as the number of updates one must perform on the system before all the sites become stable after the addition of one grain of sand.

Video An avalanche triggered which follows SandPile model

Avalanche Equilibrium

For the avalanche to reach its equilibrium we have the probability of the number of grains affected after t steps
Probability generating function:
G_t(\lambda_A,\lambda_B,\lambda_C,\lambda_D) = \sum_{N} (N_A,N_B,N_C,N_D) \lambda^{N_A}_A \lambda^{N_B}_B \lambda^{N_B}_B \lambda^{N_D}_D

At time 1,
G_1 = (n_Dp_4)\lambda_A + (n_Dp_3 + n_A)\lambda_B + (n_Dp_2 + n_B)\lambda_C + (n_Dp_1 + n_C)\lambda_D
\end{equation} \linebreak
At time t+1, the recursive relationship with $G_t is

G_{t+1} = (n_A\lambda_B + n_B\lambda_C + n_C\lambda_D) (n_Dp_4)\lambda_A + n_D(p_1\lambda_DG_t + p_2\lambda_DG_t^2 + p_3\lambda_BG_t^3 + p_4\lambda_AG_t^4)

G_\infty(\lambda) = \frac{1-\sqrt{a-4\lambda^2n_D(1-n_D)}}{2\lambda n_D}

Example: Metroploitan Buffalo population growth (1750-1990)

Insert image 9

It has been discusseed and found that when a system diverges the various parameters of the Sandpile model exhibit a power law distribution for scenarios like avalanches or population. For additional information see this

For such a distribution in case of population, we have the following parameters:
N(R) = (2R)^D,
where $\textit{D}$ = fractal dimension of the city 1\leq\textit{D}\leq2
R = effective radius of city

Cumulative populaiton density:
P(R) = N(R)/A(R) = (2R)^{D-2}
where A(R) = Area of square

Incremental Populaion
n(R) = \frac{dN(R)}{dR} = D2R^{D-1}

Incremental density
\rho(R) = \frac{dn(R)}{dR} = D^2R^{D-2}

Cities and Complexity [10/27]