Let $X_{t}$ be the hidden states and $y_{t}$ be the observations of a discrete-time state-space model (Kalman filter and hidden Markov model) identified by a transition probability $P(X_{t+1}|X_{t})$ and an observation model $P(Y_{t}|X_{t})$, where $t=1,\cdots,T$. Particle filtering for making inferences about hidden states $X_{t}$ from observations $y_{1:t}$ is comprised of a predicting/sampling step to sample particle positionss $x_{t+1}^{1}\sim P(X_{t+1}|x_{t}^{i_{t}^{1}}),\cdots,x_{t+1}^{N}\sim P(X_{t+1}|x_{t}^{i_{N}^{1}})$ at time $t+1$ from the particles $x_{t}^{i_{t}^{1}},\cdots,x_{t}^{i_{t}^{N}}$ at time $t$, and an updating/resampling step to sample particle indices according to their likelihoods with respect to the observations $i_{t+1}^{1},\cdots,i_{t+1}^{N}\sim\mbox{Categorical}(P(y_{t+1}|x_{t+1}^{1}),\cdots,P(y_{t+1}|x_{t+1}^{N}))$, from the starting particle position $x_{0}^{1},\cdots,x_{0}^{N}=x_{0}$ and indices $i_{0}^{1}=1,\cdots,i_{0}^{N}=N$. The particle approximation of the filtering probability is $\hat{p}(x_{t}|y_{1,\cdots,t})=\frac{1}{N}\sum_{k}\delta(x_{t}^{i_{t}^{k}})\stackrel{N\to\infty}{\longrightarrow}p(x_{t}|y_{1,\cdots,t})$, and the particle approximation of the likelihood function is $\hat{p}(y_{1,\cdots,T})=\prod_{t}\hat{p}(y_{t}|y_{0,\cdots,t-1})=\prod_{t}\frac{1}{N}\sum_{k}\hat{p}(y_{t}|\xi_{t}^{k})\stackrel{N\to\infty}{\longrightarrow}p(y_{1,\cdots,T})$. The conditional probability of particle locations and indices from particle filtering $\psi^{N}$; the approximate likelihood of the of the particle locations $Z_n^N$ are given in the following. \begin{eqnarray} Z_{n}^{N} = \hat P(y_0,\cdots,y_n) & = & \left[{1\over N}\sum_{l=1}^{N}p(y_{0}|\xi_{0}^{l})\right]\cdot\prod_{k=1}^{n}\left[{1\over N}\sum_{l=1}^{N}p(y_{k}|\xi_{k}^{l})\right]\to Z_{n} = P(y_0)\cdot\prod_{k=1}^{n}p(y_{k}|y_{1,\cdots,k-1})\label{Z_approx}\\ \psi^{N}=P(\vec{\xi}\_{0},\cdots,\vec{\xi}\_{n},\vec{i}\_{0},\cdots,\vec{i}\_{n}|y_0,\cdots,y_n) & = & \left[\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\prod_{l=1}^{N}\frac{p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\sum_{l=1}^{N}p(y_{0}|\xi_{0}^{l})}\right]\cdot\prod_{k=1}^{n}\left[\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\prod_{l=1}^{N}\frac{p(y_{k}|\xi_{k}^{i_{k}^{l}})}{\sum_{l=1}^{N}p(y_{k}|\xi_{k}^{l})}\right]\label{psi}\\ \end{eqnarray} There are two common particle smoothing methods for making inferences about hidden states $X_{t}$ from observations $y_{1:T}$. The genealogical tree algorithm and the forward filtering backward sampling algorithm. Genealogical tree algorithm consists of iteratively tracing the ancestral line of particle positions for $x_{T}^{i_{T}^{1}},\cdots,x_{T}^{i_{T}^{N}}$: \[ x_{1,T}=x_{1}^{i_{1}^{i_{2}^{\cdot^{\cdot^{\cdot i_{T}^{k}}}}}},\cdots,x_{T,T}=x_{T-1}^{i_{T-1}^{i_{T}^{k}}},x_{T}^{i_{T}^{k}}\mbox{ where }k=1,\cdots,N. \] The ancestral lines including the particle positions at time $T$ form a particle approximation of the posterior distribution of state trajectory \[ \hat{p}(X_{1,\cdots,T}|y_{1,\cdots,T})=\frac{1}{N}\sum_{k}\delta_{x_{1,T}^{k},\cdots,x_{T,T}^{k}}(X_{1,\cdots,T})\stackrel{N\to\infty}{\longrightarrow}p(X_{1,\cdots,T}|y_{1,\cdots,T}). \] Hence the particles $x_{t,T}^{1},\cdots,x_{t,T}^{N}$ form an approximation of the smoothing probability $\hat{p}(X_{t}|y_{1,\cdots,T})=\frac{1}{N}\sum_{k}\delta_{x_{t,T}^{k}}(X_{t})\stackrel{N\to\infty}{\longrightarrow}p(X_{t}|y_{1,\cdots,T})$. A well known problem of the genealogical tree algorithm is that all particles at time $T$ will quickly converge to the same ancestors in their ancestral lines and subsequently the genealogical tree provides a poor approximation to the smoothing distribution $P(X_{t}|y_{1,\cdots,T})$ due to insufficient sample size. Forward filtering backward sampling Backward consists of resampling particle indices backward in time $j_{n}^{k}\sim\mbox{Categorical}(P(y_{n}|x_{n}^{1}),\cdots,P(y_{n}|x_{n}^{N}))$, and $j_{t}^{k}\sim\mbox{Categorical}(P(x_{t+1}^{j_{t+1}}|x_{t}^{1})P(y_{t}|x_{t}^{1}),\cdots,P(x_{t+1}^{j_{t+1}}|x_{t}^{N})P(y_{t}|x_{t}^{N}))$ for $t=n-1,\cdots,0$. The resulting particle trajectories $x_{1}^{i_{1}^{j_{1}^{k}}},\cdots,x_{T}^{i_{T}^{j_{T}^{k}}}$ form an approximation of the posterior distribution of the state trajectory and particles $x_{t,T}^{1},\cdots,x_{t,T}^{N}$ form an approximation of the smoothing probability. The conditional probability of particle locations and indices and a particle trajectory from backward resampling is consequently \begin{equation} k^{N}=P(\vec{\xi}\_{0}:\vec{\xi}\_{n},\vec{i}\_{0}:\vec{i}\_{n},j_{0}:j_{n}|y_{1:n})=\psi^{N}(\vec{\xi}\_{0},\cdots,\vec{\xi}\_{n},\vec{i}\_{0},\cdots,\vec{i}\_{n})\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\cdot\prod_{k=0}^{n-1}\frac{P(\xi_{t+1}^{j_{n+1}}|\xi_{t}^{j_{n}})P(y_{t}|\xi_{t}^{j_{n}})}{\sum_{l=1}^{N}P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{l})P(y_{t}|\xi_{t}^{l})}.\label{k} \end{equation} The genealogical tree algorithm and the forward filtering backward sampling algorithm both generate biased approimation of the posterior distribution of state trajectory. To remove the bias, we can resort to the Metropolis-Hastings algorithm or the importance sampling algorithm. The [Metropolis–Hastings algorithm][1] is a Markov chain Monte Carlo algorithm to draw a sequence of random samples from a probability distribution $P(x)$ usually on a very high dimensional space, provided that $P(x)$ is known up to a normalization factor. The lax requirement that $P(x)$ is known up to a normalization factor makes the Metropolis–Hastings algorithm particularly useful, because the normalization factor is often extremely difficult to calculate. The M-H algorithm proceeds by randomly moving around the sample space, often staying in high-density regions of $P(x)$ and occasionally exploring low-density regions according to the acceptance ratio $\alpha$ that compares the target distribution $P(x)$ and a proposal distribution. The <a>importance sampling</a> algorithm is a variance reduction technique to integrate a function against a probability distribution $P$, where the non-zero region of the function has a very small probability. To address the problem that a sample of distribution $P$ has insufficient number of points in the important region, we sample from another distribution $q$ that overweights this important region, and weight the sample points according to how the target distribution $P$ compares with the proposal distribution $q$. In either Metropolis-Hastings algorithm and importance sampling algorithm, we need to compare the proposal distribution and the target distribution. We proceed to show that the probability for us to get a particle trajectory from either the genealogical tree algorithm or the FFBS algorithm times the empirical probability $Z^N_n$ (Eq. $\ref{Z_approx}$ equals the joint probability of the particle trajectory and the observations, thus is proportional to the posterior probability of the particle trajectory according to the original dynamics. Hence we can combine the particle trajectories from multiple particle smoothing runs with weights $Z^N_n$. To find the probability of an ancestral line $(\cdots,\xi_{n-1}^{i_{n-1}^{i_n^1}},\xi_n^{i_n^1})$ from the genealogical tree algorithm, we can integrate/sum over all other particle positions and indices $i_n^l$ for $l=\{2,\cdots,n\}$, $\xi_n^l$ for $l=\{1,\cdots,n\}\backslash\{i_n^l\}$, $i_{n-1}^l$ for $l=\{1,\cdots,n\}\backslash\{ i_n^1 \}$, $\xi_{n-1}^l$ for $l=\{1,\cdots,n\}\backslash\{ i_{n-1}^{i_n^1} \}$, …. \begin{eqnarray*} & & \sum_{\mbox{except }i_{n}^{1},i_{n-1}^{i_{n}^{1}},\cdots;\xi_{n}^{i_{n}^{1}},\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}},\cdots}\psi^{N}Z_{n}^{N}\cdot N^{n+1}\\ & = & \sum_{\mbox{except }i_{n}^{1},i_{n-1}^{i_{n}^{1}},\cdots;\xi_{n}^{i_{n}^{1}},\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}},\cdots}\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\frac{\prod_{l=1}^{N}p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{0}|\xi_{0}^{l})\right)^{N-1}}\prod_{k=1}^{n}\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\\ & = & \cdots\int d\xi_{n}^{1},\cdots,\xi_{n}^{i_{n}^{1}-1},\xi_{n}^{i_{n}^{1}+1},\cdots,\xi_{n}^{N}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})\sum_{i_{n}^{2},\cdots,i_{n}^{N}}\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\\ & = & \cdots\int d\xi_{n}^{1},\cdots,\xi_{n}^{i_{n}^{1}-1},\xi_{n}^{i_{n}^{1}+1},\cdots,\xi_{n}^{N}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})p(y_{n}|\xi_{n}^{i_{n}^{1}})\cancel{\prod_{l=2}^{N}\frac{\sum_{i_{n}^{l}=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})}}\\ & = & \cdots p(y_{n}|\xi_{n}^{i_{n}^{1}})p(\xi_{n}^{i_{n}^{1}}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})\cancel{\prod_{l\ne i_{n}^{1}}\int d\xi_{n}^{l}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})}\\ & = & p(y_{n}|\xi_{n}^{i_{n}^{1}})p(\xi_{n}^{i_{n}^{1}}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}})p(\xi_{n-1}^{i_{n-1}^{i_{n}^{1}}}|\xi_{n-2}^{i_{n-1}^{i_{n-1}^{i_{n}^{1}}}})\cdots \end{eqnarray*} To find the probability of a backward sample $\xi_0^{j_0},\cdots,\xi_n^{j_n}$, we marginalize over all other particle positions and indices $\xi_n^l$ for $l\ne j_n$, $i_n^l$ for $l=1,\cdots,N$, $x_{n-1}^l$ for $l\ne j_{n-1}$, $i_{n-1}^l$ for $l=1,\cdots,n$ …. \begin{align*} & \sum_{\text{except }j_{1:n},\xi_{1:n}^{j_{1:n}}}k^{N}Z_{n}^{N}\\ = & \sum\prod_{l=1}^{N}\rho_{0}(\xi_{0}^{l})\frac{\prod_{l=1}^{N}p(y_{0}|\xi_{0}^{i_{0}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{0}|\xi_{0}^{l})\right)^{N-1}}\prod_{k=1}^{n}\prod_{l=1}^{N}p(\xi_{k}^{l}|\xi_{k-1}^{i_{k-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{k}|\xi_{k}^{i_{k}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{k}|\xi_{k}^{l})\right)^{N-1}}\cdot\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\prod_{k=0}^{n-1}\frac{P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{j_{t}})P(y_{t}|\xi_{t}^{j_{t}})}{\sum_{l=1}^{N}P(\xi_{t+1}^{j_{t+1}}|\xi_{t}^{l})P(y_{t}|\xi_{t}^{l})}\\ = & \sum\cdots\sum_{i_{n}^{1:N},\xi_{n}^{\backslash j_{n}}}\prod_{l=1}^{N}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})\frac{\prod_{l=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}\\ = & \sum\cdots p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})\prod_{l\ne j_{n}}\bcancel{\sum_{\xi_{n}^{\backslash j_{n}}}p(\xi_{n}^{l}|\xi_{n-1}^{i_{n-1}^{l}})}\frac{\cancel{\prod_{l=1}^{N}\sum_{i_{n}^{l}=1}^{N}p(y_{n}|\xi_{n}^{i_{n}^{l}})}}{\cancel{\left(\sum\limits \_{l=1}^{N}p(y\_{n}|\xi_{n}^{l})\right)^{N-1}}}\frac{P(y_{n}|\xi_{n}^{j_{n}})}{\cancel{\sum_{l=1}^{N}(y_{n}|\xi_{n}^{l})}}\\ = & \sum\cdots\sum_{i_{n-1}^{1:N},\xi_{n-1}^{\backslash j_{n-1}}}p(\xi_{n-1}^{l}|\xi_{n-2}^{i_{n-2}^{l}})\frac{\prod_{l=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{l}})}{\left(\sum\limits \_{l=1}^{N}p(y\_{n-1}|\xi_{n-1}^{l})\right)^{N-1}}\frac{P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})}{\sum_{l=1}^{N}P(\xi_{n}^{j_{n}}|\xi_{n-1}^{l})P(y_{n-1}|\xi_{n-1}^{l})}\cdot p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})P(y_{n}|\xi_{n}^{j_{n}})\\ = & \sum\cdots p(\xi_{n-1}^{j_{n-1}}|\xi_{n-2}^{i_{n-2}^{j_{n-1}}})\bcancel{\sum_{l\ne j_{n-1}}p(\xi_{n-1}^{l}|\xi_{n-2}^{i_{n-2}^{l}})}\cancel{\sum_{i_{n-1}^{j_{n}}=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{j_{n}}})p(\xi_{n}^{j_{n}}|\xi_{n-1}^{i_{n-1}^{j_{n}}})}\frac{\cancel{\prod_{l\ne j_{n}}\sum_{i_{n-1}^{l}=1}^{N}p(y_{n-1}|\xi_{n-1}^{i_{n-1}^{l}})}}{\cancel{\left(\sum\limits \_{l=1}^{N}p(y\_{n-1}|\xi_{n-1}^{l})\right)^{N-1}}}\frac{P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})}{\cancel{\sum_{l=1}^{N}P(\xi_{n}^{j_{n}}|\xi_{n-1}^{l})P(y_{n-1}|\xi_{n-1}^{l})}}\cdot P(y_{n}|\xi_{n}^{j_{n}})\\ = & \sum\cdots\cdot p(\xi_{n-1}^{j_{n-1}}|\xi_{n-2}^{i_{n-2}^{j_{n-1}}})P(y_{n-1}|\xi_{n-1}^{j_{n-1}})P(\xi_{n}^{j_{n}}|\xi_{n-1}^{j_{n-1}})P(y_{n}|\xi_{n}^{j_{n}})\\ = & \rho(\xi_{0}^{j_{0}})P(y_{0}|\xi_{0}^{j_{0}})\prod_{t=1}^{n}P(\xi_{t}^{j_{t}}|\xi_{t-1}^{j_{t-1}})P(y_{t}|\xi_{t}^{j_{t}}). \end{align*} Hence the $l$-th particle trajectory from the $k$-th particle smoothing run ${^k}\xi{\_{1:n}^{j\_{1:n}^l}}$ has an importance weight ${^k}Z{^N_n}$. In other words, ${1\over K}\sum_{k=1}^K {1\over L_k}\sum_{l=1}^{L_k} \delta_{{^k}\xi{\_{1:n}^{j\_{1:n}^l}}}(X_{1:n})\cdot {^k}Z{^N_n}$ is an estimate of the posterior probability $P(X_{1:n}|y_{1:n})$. The Metropolis-Hastings algorithm to sample particle trajectories from multiple particle smoothing runs work in the following way. Basis: Set the current particle trajectories from particle smoothing ${^*}\xi_{1:n}^{j_{1:n}^l}={^1}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_1$ and keep ${^*}Z^N_n={^1}Z^N_n$. Induction: Run particle smoothing to get ${^k}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_k$ and keep ${^k}Z^N_n$. Accept the new trajectories with probability $\alpha=\min(1,{{^k}Z^N_n\over {^*}Z^N_n})$ and update the current trajectories ${^*}\xi_{1:n}^{j_{1:n}^l}={^k}\xi_{1:n}^{j_{1:n}^l}$ for $l=1:L_k$ and ${^*}Z^N_n={^k}Z^N_n$.

]]>The topic of this lecture is Introduction of Statistic Physics. Statistic Physics is a course.In Statistical Physics, one thing that people do is to explain the microscopic phenomena, microscopic interaction of the agents of the atoms and how this was accomplished and how we expect to accomplish those things in our study of our social systems. In Statistic Physics, what people normally do in Physics is we use the least amount of laws to explain the most amount of phenomenons that we encounter.

The beauty of this is that if we only have the least amount of laws, then our possibility of making errors will be minimized, and if we know that those laws have been tested or have been refuted by researchers of many generations, so this means that there is something about those laws that we can rely on. In terms of the thermodynamics, people previously didn’t know what was happening. It turned out to be something statistics.

Prior to the 20th Century, the ideas that people normally think that given us enough measure or enough tool, sufficient tool to observe everything, we’ll be able to explain everything through Newton’s three rules, but after the beginning of the 20th Century, people began to think of information and think of statistics. The prevailing idea including the idea of today is that we’re not able to observe everything about our system, so this means that we will have to talk about probability. What is the probability for us to observe certain things?

In terms of the thermodynamics, what people have observed they include the following, so we have Joe. Previous people didn’t know what is heat so some people say that heat is perhaps some kind of substance like atoms that move from one place to another place. This turned out to be not true because we can construct many counterexamples. For example, when we have ice melting into water, it seems like everything is the same and so where is those atoms or where did those atoms … from, right? This is one example.

Another example is that it seems that people found that people could generate infinite amount of heat, so this means that there’s no conservation law about the number of heat because we can generate the heat infinitely. There was this person named Joe. Joe constructed a way to turn physical energy into heat and he also constructed a way to turn electrical energy into heat. This means that those things could be interchanged. Another finding that was pretty important is this Carnot Circle four stages.

We could use those four stages to either translate some heat into mechanical energy or translate mechanical energy back into heat. Based on all of those things, what people found out is ultimately something that look like the following. We have the internal energy of the system and this energy is a functional volume of course because if we compress the system, if we expand the system, we change the configuration of the system and we change the physical composition. There’s another observation about thermodynamics physics, which is that a system always goes … Heat always goes from high temperatures.

If we have two systems together, heat goes from the subsystem with higher energy to the subsystem of lower energy, so this means that there is a flow kind of thing in terms of heat. Based on all of those things, people ultimately find things in this kind of fashion. Here we’re using first order approximation. This is in terms of temperature, temperature times the change of entropy and the pressure times the change of volume. What people think here, the mind frame is that … First, it is the first order approximation and second, we have two different types of variables.

One type of variable is called the intensive variables and another part of variable is called extensive variables. When we combine subsystems into one, the value of an extensive variable of the combined system equals the sum of values of the subsystems, the differences in intensive variables between subsystems drive the change in corresponding extensive variables. Extensive variables include things such as weight. If we put two systems together, then the weight of this combined system is the weight of those two subsystems. Another kind of variable is called the intensive variables. For example, pressure is intensive variable because if we put two subsystems of different pressures together, then the difference of the pressure will drive the movement of the molecules from the subsystem of a higher pressure to the subsystem of lower pressure. There are several things that are very frequently mentioned or very frequently used. For example:

$d U(S, V) = \left({\partial U\over\partial S}\right)_V d S + \left({\partial U\over\partial V}\right)_S d V$,

where temperature $T=\left({\partial U\over\partial S}\right)_V$and pressure $p=-\left({\partial U\over\partial V}\right)_S$ are intensive variables, entropy and volume are extensive variables.

There are several things that are very frequently mentioned or very frequently used.

First, which is the $\textbf{free energy}F = pV = U – T S$ is Legendre pair of internal energy $U$, and $T$ and $S$ are the Legendre conjugate variables.

We know that it’s pretty hard to turn the energy associated with entropy into mechanical energy that we can use. Part of energy associated with entropy is something that we cannot use, but the rest part of the energy is something that we can use. We call that part of energy “free energy.” If we think of the form of free energy, so what we find out is that it has multiple occasion of intensive variable and extensive variable. Then we have two things, one is free energy and one is this internal energy. If we recall our lecture on convex optimization and convex deal and variational method, what we find out is that the free energy is a Legendre-Fenchel pair of the internal energy.

Second,which is the $\textbf{Enthalpy}H(S,p) = U(S,V) + p V$ is Legendre pair of internal energy $U$, and $p$ and $V$ are the Legendre conjurage variables. Entropy which is the part that we cannot use, entropy is the Legendre-Fenchel pair of the internal energy and the Legendre conjugate variables are $P$ and $V$.

Similarly, $\textbf{Gibbs free energy}G(T,p) = U – T S + p V$ is Legendre pair of internal energy $U$, with $(T,p)$ and $(S,V)$ being Legendre conjugate variables.Gibbs free energy is the convex pair. We also it call Legendre pair convex pair. Gibbs free energy is the convex function pair of the internal energy with the conjugate variables being two-dimensional. We have on the one hand $T$ and $P$, both are “temperature” and “pressure.” Both are intensive variables and we also have $S$ and $V$, which is entropy and volume. Both are extensive variables.

Now the question is that, so it is not very surprising that here we have the convex optimization here in our variation method because the idea of convex optimization ultimately came from here. Similarly for example, we think of graphic models that we talked about previously in this class and the idea of graphic models is also from Statistic Physics. Here I just want to review on the interpretation of the Legendre Transform. We have two so we’re given a function $F$ and $F$ is a convex function.

We define the Legendre Transform :

$f: I\to \mathbf R$ is a convex function.

$f^*(x^*)=\sup_{x\in I} (x^*\cdot x-f(x))$is its Legendre transform.

so a Legendre transform has different variables. We define the Legendre Transform of this one at point $x^*$ is the supreme of when we multiply the intensive variable with the extensive variable. We multiply a variable with its pair and subtract function $F$ from the variable and take the supreme. We get the Legendre Transform at the conjugate variable. If we think of, in this form, we have on the $X$ axis we have variable $V$ and on the $Y$ axis we have variable $P$.

We can express variable $V$ in terms of $P$, and we can express variable $P$ in terms of variable $V$, and we have the area underneath this line here, and this is monotonically increasing function. The area under this line here, we express this as $L$. The area to the left-hand side of this line we express this as $P$. What we have is that $L$ is the integral from zero to $V$. Understand $L$ is the area underneath and $H$ is the area to the left. What we have is $L$ of $V$ plus $H$ of $P$ equals $P$ times $V$. If we think of it in another way, so what we have is $H$ of $P$ equals $2P$ times $V$ minus $L$ of $V$.

$L(v)=\int_0^v p(v’)dv’$ $\Rightarrow {\partial L\over\partial v}=p$

$H(p)=\int_0^p v(p’)dp’$ $\Rightarrow {\partial H\over\partial p}=v$

$H(p) = p\cdot v-L(v)$

$L(v)+H(p) = p\cdot v$

This is another interpretation of the Legendre Transform. What we want to notice is that we have two functions, $L$ and $H$, and $L$ and $H$ are Legendre pairs, and $L$ and $H$ take different variables. If we think of Legendre Transform of function that takes a different state of variables, so not necessarily the variables of the original functions. This is one interpretation of the Legendre Transform. Actually, they have names. $H$ is what people normally call energy, so Hamiltonian and $L$ is the Lagrangian so $L$ and $H$ have different names.

function $y=f(x)$ corresponds to

envolop $y=p\cdot x-f^*(p)$ where

$p=f'(x)$ is slope of $f(x)$ at $x$

$f^*(p)$ is intercept determined by $p$

$f^*(p)$ is convex in $p$

$f(x)=f^{**}(x)$ if $f(x)$ is convex

if we compare two. People first understand thermodynamics before people think of it in terms of Legendre Transform. We have the science of those things that look pretty weird but people first have the equation of the increment of energy and then people later find out that all of those things have the same form which is in terms of the Legendre pairs. If we take $T$ and $S$ as Legendre conjugate variables, we get free energy. If we take another set of Legendre conjugate variables we get another pair of Legendre functions.

This the main interpretation that we talked about before and in this one we have a function $Y$ here and we want to find out the Legendre Transform in this one, so what we do is we first give a envelope and then we move the envelope up until we first touch the line of this function and this function is a convex function, so this slope will touch the function at some time for the first time. At the time that this slope which is identified as $P$, at the first time that the slope touches the function, we use the intercept at the $Y$ axis as the function; that is the Legendre pair corresponding to the slope.

This is another interpretation of this idea. Those interpretations are actually the same. What we have is $F$, so the Legendre function at $P$ plus the original function at $X$. We know that $P$ and $X$ are Legendre pairs. If we multiply those things together, then the total length is this one which is $X$ times the slope, $P$ of $X.$This is another interpretation of this. As we mentioned, people have found many interesting things about the thermodynamics but the question people had was, how do we explain the thermodynamics using the least amount of laws, the least amount of assumptions to make an interpretation of this?

Here $E$ is a scale of variable which is the energy but $E$ could be other things. It could be a long feature, so what we know is that maximum entropy probability given as the set of features, the sufficient statistics $E,$ is going to take a exponential form. We have a set of parameters, the natural parameters corresponding to those features, beta and $E.$ Given $E$ we can find out beta and given beta we can find out $E.$ According to Boltzmann the probability of a system to be in energy, $E$ of $J,$ so $J$ is an index of the energy, so if we think of a system with a finite number of molecules, then the number of energies is finite and we can actually number those energy configurations.

According to Boltzmann it looks like the following form.

$P_j = \exp(-\beta E_j) / Z$, where – $P_j$ is the probability of observing the $j^\mbox{th}$ state with energy $E_j$

$Z=\sum_j \exp(-\beta E_j)$ is the **partition function**.

$Z$ here is called the partition function and it is the normalization constant to make it probability. The conventional way and the classical way to demonstrate that we have this distribution is the following. We consider a small cylinder and this cylinder is what we consider and we put this small cylinder in a heat reservoir. We have considered this small cylinder there and energy as $E$ of $J$ at this, it is in energy configuration $J$ and the total energy.

The energy here is energy of $R$ which is the total energy minus $E_j$ so the total energy here is $E$ and we assume that the total energy in this duration, we assume that the total energy of the whole system of reservoir and the small tube is constant so $E_R$ plus $E_J$ equals to $E$.

The following one is a very critical assumption which says that in a system all configurations are equally like. Of course we have constraint, which is all configurations should have the same energy, which is energy $E,$ right? This is very critical and this is actually what we have previously identified as maximum entropy. In a system, okay, so maximum entropy, all configurations are equally likely.

$\log P_j \propto \log\Omega(E-E_j) = \log\Omega(E)-\beta E_j+\cdots \Rightarrow P_j\propto \exp(-\beta E_j)$. Now what we do is to take the first order approximation so around the energy “E,” so we have log of “E.” This is the first energy turn here and beta naturally comes out from this as the partial derivative of the log partition function. I should say log number of configuration over energy and from here we can say that the probability for the system to be in energy “I” is proportional to some exponential form over there.

${\partial \log Z\over \partial \beta} = -\sum_j E_j P_j = -U$, where $U$ is internal energy

${\partial \log Z\over \partial V} = {1\over Z} \sum_j {\partial\over\partial V}\exp(-\beta E_j(V))= \beta \sum_j p_j P_j = \beta p$ where $p_j$ is the **pressure** associated with the $j^\mbox{th}$ state.

– $d\log Z= {\partial\log Z\over\partial\beta} d\beta + {\partial\log Z\over\partial V} d V=-U d\beta+\beta p d V = -d(\beta U) + \beta \underbrace{(d U + p d V)}_{T d S}$, or $T d S = {1\over\beta} d (\log Z+\beta U)$, or $S=k \log Z + {U\over T}$, where $k={1\over\beta T}$ is Boltzmann’s constant, or $U-T S=k T \log Z = F$ is free energy. – $S = k\log Z + k\beta U = K \log Z + k \sum_j \beta E_j P_j = -k \sum_j P_j \log P_j$ entropy relationship.

What are the properties of $F(x)$?

It is a function. First it is a function between zero and one, because if $F(x)$ is the probability, X less than or equal to X. The probability for this event to happen. If we talk about probability, a probability is something that is, according to that one equal to zero and smaller or equal to 1. That’s why this is always between zero and one. The second property is that, this is always now decreasing and the reason is that, by talking about probability, we talk about positive, a probability mass function, density function, those are positive. This means that we do not cope with negative probability or apart with negative probability.

It is always increasing. We also know that as X tends to minus infinity we have zero, because nothing hides as a minus infinity. If we go to, if we proceed to positive infinity we get 1 because assuming because nothing hides there. That is the probability, cumulative probability function.

What is Inverse Transform?

$F^{-1}(y) = \inf\{x|F(x)\ge y\}$ where $0\le y\le 1$ is the inverse of $F$

What we do in the method of inverse transform, is that, we first take the inverse of $F(x)$. After we take the inverse of $F(x)$, so it is something F minus one to Y and that is defined as the infimum or minimum, the smallest number Y. The infimum of let’s say, the infimum of X, such that, $F(x)$ is greater than or equal to Y.

It’s not a problem if we cope with a continuous function but, if we cope with for example, function that hits support. A finite number of, that takes support on a finite set, then we’ll have to differentiate between minimum or maximum. For example, this one what we have is, I think, whenever we have a mass, we have a junk here, it is something that is, let’s say, that is left continuous.

Here is the inverse transform of uniform random variable $U$ on $(0,1)$

$\bf P(x)=F^{-1}(U)$ follows the distribution of $F(x)$

proof:

$$

\begin{align}

&{\bf P}(\{F^{-1}(U)\le x\}) \\

= & {\bf P}(\{U\le F(x)\}) \\

= & F(x)

\end{align}

$$

We first saw, we know that the cumulative distribution function is from zero to 1 and what we do is we first sample on the Y axis. After we sample on the Y axis, we find out the corresponding value on the X axis and we take this X and this X has the cumulative distribution function, that is $F(x)$.

This is another example.

$F(x)=1-\exp(-\lambda x)$ is cumulative distribution function of a exponential random variable.That is how we get a sample from explanation distribution through this method of inverse transform.

$\bf P(x)=F^{-1}(U) = -{1\over\lambda}\log(1-U)=-{1\over\lambda}\log U_1$

Here, I’m sampling from a randomized a uniform distribution. I’m sampling from a uniform distribution and then I make the transform and then I compare the quantiles of those two functions and we find out that the quantiles house are equal. This means that, this just intuitively help us to verify that. This is the right way to sample a function where if no one cumulative distribution function from a uniform distribution.

Another way to sample from a distribution is, a method called acceptance, rejection. Suppose that we have a probability density function which is $g(x)$. The random variable $Y$ has this probability of density function. Suppose that whenever we sample this $g(x)$, we either accept, no matter whether we make a sample of $Y$, we either accept this $Y$, or we reject this I. The probability of accepting this $Y$ is $h(y)$. Here, what we want to show is that, the result of accepting the $Y$ is a random variable and it has probability density function which is $g(x)\times h(x) $. The reason is the following.

The cumulative distribution of function of $x$ is going to be the $P(X\leq x)$ . That one has probability density function which is $\int_{-\infty}^x g(x) dx$.Accepting Y could be, we could have different radar of acceptance corresponding to different values of this Y. We need to multiply this one by $H(x)$ and that is the probability but then we need to normalize this one.

We need to multiply our normalization constant and the normalization constant is going to be this one from $\int_{-\infty}^X h(x)g(x) dx $ We know that the sum is going to be $\int_{-\infty}^\infty h(x)g(x) dx$. That’s how we claim that the probability density function of $Y$ is $g(x)\times h(x) $.

In order to do that, we first sample from a uniform distribution, distributed or supported by the interval zero and one and then accept this sample point of uniform distribution according to this function. After we do this, we know that the result will be a beta distribution. Here, I just, you run the code and here I have plotted the real distribution of $B(4,3)$ here and then I have plotted the distribution that I have estimated from acceptance rejection method, which is the red line. I’ve also created another way of sampling a beta distribution which is through other statistic.

We can see that those three lines are pretty close and as the sample size turns to infinity the red line and the green line should approach this black line here. You can study the code afterwards.

There are also other methods of sampling distribution. Let’s suppose that we have cumulative distribution function which is $F$.

$F=\sum_1^k p_i F_i$ is cumulative distribution function, where $\sum_i p_i=1$

$F$ is a mixture of $K$ distributions and $F_1$, to $F_k$. We can decompose $F$ into a mixture of distributions, then we can sample this $F$ by first sample from the category. Which category will we sample?

After we sample from the category, like we know that we want to sample the Jth category then we sample $x$ from $FJ$. $X$ has cumulative distribution function which is $F$ because, if we find out the cumulative distribution function for $X$. Here we apply, we replace this, the probability $J$ is equals to $j$ with the probability for us to take a sample from the Jth component, we get $F(x)$, which is what we require.

– We first sample $J$ from categorical distribution $(p_1,\dots,p_k)$ then

sample $X$ from $F_J$

– $X$ has the distribution defined by $F$

$P(X\le x)=\sum_j P(X\le x|J=j)P(J=j)=\sum_j P_j(X\le x)p_j=F(X)$

Example of sampling through the method of composition is that, suppose that we want to make a sample that is pretty irregular, looks like the following.

$f(x) = 0.5+2\cdot (1-0.5)x$

From zero to 1 and it takes, then we have this one here and then we have this one here. Because we, this is the PDF and we require the area, is under this line to be one. We know that it’s PDF of a continuous random variable supported by the interval zero and one.

One way to sample this one is to find out if it is a composition of two distributions. In order to do that, we first make a sample of which distribution you want to sample. Either you want to sample from first distribution or you want to sample from second distribution. After we find out, we sample which distribution we want to sample proceed to either sample first distribution or we sample second distribution. The key to implement is actually pretty easy. We can study this code.

There are other ways to sample a random variable and this is according to the properties of a random variable.

Representing random variable $X$ in terms of other variables easily generated. For example:

– Sample from **binomial distribution** as the sum of i.i.d. Bernoulli trials

– Sample from **Poisson distribution** as the number of i.i.d. realizations of an exponential distribution before total time exceeds unit time

– Sample from **Erlang-$k$ distribution** as the total time of $k$ i.i.d. realizations of an exponential distribution

– Sample from **beta distribution** $B(k,n+1-k)$ as the $k^{\rm th}$ smallest number in $n$ numbers uniformed distributed in $[0,1]$

Let’s talk about sample from a Poisson distribution as the number of ideal realizations of exponential distribution before we have reached a certain amount of time.

The reason is that, a Poisson distribution is the distribution of the number of Poisson events. Meaning that if we put the intervals of exponential distributions, heads to tale together until we reach, until we get from one end of interval to other of the interval. We count the number of intervals here, inside this small intervals here, inside this interval and the number as a poisson distribution.

Similarly, this distribution that people normally use to model the origin of phone calls. We can sample our own distribution as a total time of ideal realizations of exponential distribution. This is an example that we sample Poisson distribution according to the property of Poisson distribution. Here X is a sample of Poisson distribution from exponential distributions and Y is a sample from the Poisson distribution directly. If we plot the quantile plot of those two, we find out that they are pretty similar and they’re the same type of distributions. They are also other methods to sample from a probability distribution which we’ll talk about later.

They are also other methods to sample from a probability distribution which we’ll talk about later. For example, we have, you might have already heard of it, so which is give sampler or we might use something that is called metropolis hastings sample which is very similar to this rejection acceptance sample that we talked about before.

In order to understand those two ways to sample from a random variable, we first need to understand on some concepts about stochastic process because those two ways of sampling from random variable probability distribution depends on asymptotic properties of a stochastic processes. Meaning that, we sample a process, this is a random process, again and again until we lose all memory of where we have started and at that time, we know that the distribution of the theme of some statistic that we sample observes the distribution that we are interested in.

We have talked about many different ways of sampling from probability distributions. Even more important thing is, if we want to work with probability, we first want to know which probability distribution should we choose. It’s a random variable of exponential distribution or is it Poisson distribution or is it a random variable of some other distributions. This involves statistics or testing. We have a hypothesis which is that, a random variable observed some distribution and we want to test whether it is more likely that this random variable has this distribution or less likely and reject the unlikely cases.

Here’s just an example of hypothesis testing. A hypothesis testing involves the following concepts.

– **belief/null hypothesis**: a coin is fail

– **experiment**: toss coin for 10 times, resulting a **sample** of 10 tosses

– **test statistic**: number of heads.

– **significance level**: probability of rejecting hypothesis/belief when it is true

– **cretical region/region of reject**: reject hypothesis when statistic falls into this region.

In order to conduct the hypothesis testing, we first need to have a belief about what we have here, without the belief we cannot do anything and we also call this belief a now hypothesis. Like, if we want to test whether a coin is a fail coin meaning that we get the same, approximately the same number of heads and tales or I say it another way, whether we get the same number of heads and tales if the number of samples sides turns to infinity.

Suppose that this is our hypothesis and after we have a hypothesis, we proceed to conduct experiment. An experiment involves that we toss these coins for ten times, like ten times, twenty times that’s not infinite amount of times. After this experiment we get a sample which is ten tosses of a coin. After this sample, like we can get head, tale, head, tale, something like that. After this test, after we get this sample, we want to conduct some, we want to come up with some test of statistic. One intuitive way of finding out whether it is more likely or less likely for us to come out with a sample is we find out the probability for us to get the sample but it will not work.

The reason is that, under the now hypothesis that we have a failed coin, let’s say, I think there’s a type. It’s not fail, a coin is a fail. It’s a type of coin that’s fail. One intuitive idea, we want to find out the probability for us to get a sample of ten tosses. The higher the probability, the more likely that the hypothesis is true. That will not work because suppose that we have a failed coin then the probability for us to get any sample of ten tosses is going to be ten to the power of, let’s say point five to the power of ten, so it will not work. There is another way, which is, which involves just statistics.

If we say our test of statistic is the total number of heads in the ten tosses, so this looks more promising. If a coin is a failed coin, then we’re likely to get five heads and five tales or four heads and six tales or six heads and four tales and it is very unlikely for us to get one head or two heads, or nine heads or ten heads.

The reason is that, by using the test of statistic which is the number of heads, we have a binomial distribution, something like that. We know that this area’s pretty likely and this area is pretty unlikely and this area is pretty unlikely. Now we find some promising test of statistic. The important thing for this test of statistic is that, we know the probability distribution of this test of statistic under the assumption of the now hypothesis, which is that, the distribution is binomial distribution.

We can proceed to reason about whether the hypothesis, the now hypothesis is more likely or less likely. As we said if it is in this area, and from this area and from this area, it is less likely. No matter what corresponding to each significance level, we always get some kind of critical region of rejection. If it is on this or on number of cases, in this region number of cases in this region, we reject our now hypothesis which is that the coin is a fail. Other wise we say that the hypothesis is compatible with our sample or our sample is compatible with the hypothesis. First the way that we choose significant level is arbitrary. We can choose what type of or significant levels we could choose for example 1.0, 1.001 and corresponding to different significance level we have different critical regions, a region of rejection.

In doing the statistical test, we always have the risk of either rejecting this hypothesis when it is correct or accepting this hypothesis when it is wrong. Like for example, when the hypothesis is correct and the coin has indeed failed, there is the possibility for us to get into this area and to get into this area always exists but we rejected those things. In addition, if the hypothesis that the coin is failed, is wrong, the coin is not failed, it is possible for us to get into this region and we get a number of answer which is in this region we have to accept the hypothesis as something that is compatible but it is wrong.

We’re coping with probability here and actually I just revealed test. Actual thing that I want to talk about is that, since we’re working with probability distribution, we always care about how good a sample fits probability distribution. We have a sample, we want to know whether the probability distribution is a good fit to the sample or a bad fit to the sample. This is a standard Gaussian distribution. I randomly sampled ten points from distribution. Another Gaussian distribution that means zeros and standard deviation, one which is, has the same distribution with this one. We can say that the sample represented by and the red dots looks pretty compatible with this distribution. We have made another sample which according to normal distribution which has mean, average of value one and standard deviation, point five which is represented by and the blue bus and we can say that the sample represented by the blue bus is a bad fit to the standard normal distribution here.

We can talk about different things. The first thing is we can talk about parametric. We can conduct parametric goodness of the test, which says that, suppose that we know that our sample is from a normal distribution, then whether or not we have got the right mean value of the distribution. Whether or not we have got the right variance or standard deviation in our distribution. This involves parameters, so we call it parametric test. Similarly to our example of testing whether a coin is a fail or not, one important thing is that, in order to check a test of statistic, we want to know the probability density function or the cumulative probability function of this test of statistic. Here for example, if the test of statistic is the average value, then we want to first find out the PDF or CDF of this test of statistic before we can set out a level of acceptance and find out the critical region or the region of rejection.

Here, if we’re talking about the, for example if we know that our sample is from a Gaussian distribution, and if we know the variance of this Gaussian distribution, then our estimator which is … $X_1$ to XI are random variables. Our estimator is far as some function of those random variables thus it is also a random variable. Whenever we talk about random variable, we can talk about the distribution, right, the probability of this random variable.

Let’s say, since we know that those X are taken from a Gaussian variable, and there’s a property of Gaussian random variable, which is a linear combination of some of those Gaussian random variables is also Gaussian random variable. We know that this one is a Gaussian random variable and the minimum of this Gaussian random variable of X [ ] of E. E is going to be E1 over N. We have this one so we first take this N out and then we take this submission sign out of this E so we get to this one. Name of SI and since we know that all of those are example of the same distribution, so we know that this is going to be times $N$ times nil, so that is going to be nil.

That is the average value of this $X$. Similarly we can find out the variance of this $X$. The variance of this x but here, yeah. Similarly we can find out the variance of this $X$ and the variance of this $X$ is, let’s say, it’s going to be more complicated but I think it is going to be this one. This means that $X$ bar is going to be a normal distribution with … I just write it this way. $X$ is going to be a normal distribution with this one and this one. We can actually $X$ bar. We can find out, we can actually plot the PDF of this $X$ bar. For example, if $X$ bar, it’s a random variable it can’t be anywhere. If it’s here then it’s more likely. Yes

Similarly, so suppose that we do not know the standard deviation of this Gaussian random variable from which we get our sample. Then one idea is to normalize the deviation of this X from by dividing it with the standard deviation. As a result we no longer get Gaussian random variable but we get key statistic that observes another probability the density distribution. This is parametric goodness of a fit test which supposes that, which assumes that we know the underlying probability distribution but we need to fix some parameters and want to check whether the parameters are compatible with the sample. Here we can also talk about number metric goodness of fit. Meaning that we want to fit a sample, we want to see whether a sample is compatible with a distribution. There are two ways to do this. One is what we call Kolmogorov test which is based on …

If we plot the probability let’s say the cumulative probability function, cumulative distribution function of random variable X and a cumulative distribution function of random variable Y, it could be something like this. If they are the same distribution then we could get a straight line here. The deviation of this plot of the sigma of X against sigma of Y gives us some signal about, some indication about to what extent those two distributions of different distributions. The Kolmogorov test, the statistic that is taken back, the Kolmogorov test is actually the supreme, the maximum deviation from the curve. That is, sigma for $X$ against sigma for $Y$ to this line, $Y$ equals to $X$. The interesting thing is that, if the sample size turns to infinity, we actually know the underlying perform of this Kolmogorov distribution.

This is how we test whether two distributions, two random variable observe the same distributions to samples of the same distribution from using Kolmogorov test. Another test that we talk about is this so called the goodness of fit test and that test uses a statistic which is a chi square statistic and which observe a chi square distribution. The chi square distribution is a distribution that generates from the sum of the square of standard normal distribution. We have $X_1$ to $X_N$ which standard normal distribution meaning that there was a thousand distribution with mean zero and standard deviation 1. If we sum the square of all of those distributions together, we get a chi square distribution with the grey of freedom which is $N$. The reason that we can use chi square test to test goodness of fit of many distributions is that, first we can use chi square test to test whether a distribution fits a binomial distribution. Then we can, if we can use it to test goodness of fit of a sample to a binomial distribution then we can extend the result to multi-normal distribution.

When we work with continuous distribution, what we do is to cut the real line into intervals, into beams. Ultimately we end up with working with either binomial or multi-nomial distributions. That is the goodness of fit chi square test. Also, when we work with different distributions, we need to estimate the parameters. We first talk about testing whether the parameters are right parameters and whether the distributions are right distributions. Now we talk about how, that there are many different ways to estimate the parameters.

One way that has been used quite often is the maximum likelihood estimated. In order to estimate, make the maximum likelihood estimation with the parameters, we first write off the likelihood which is the probability of the sample condition on the parameters and then we maximize the likelihood. In order to maximize the likelihood, we normally take the derivative of this likelihood over the parameters supposing that the likelihood is a continuous function of the parameters. Then we set this derivative to be zero and then we can find out the parameters. What we know is the maximum likelihood distribution is a asymptotically normal, meaning that, when the sample size goes to infinity, we have a normal distribution. It is efficient estimator in the sense that first it is, meaning that the variance of this estimation is a random variable as we do that before. We want to find estimation that has smaller variance than other distributions.

What we know is that, the maximum likelihood estimator which is random variable has the minimal of variance over many other similar estimators. Those I want to talk about in this lecture and in the next lecture we’ll proceed to talk about different stochastic processes as well as how we sample from those stochastic processes.

]]>

Let us tossing a coin repeatedly. Whenever we toss a coin, we conduct a **Bernoulli trial**, so we have, as a result, we could have a head up or we could have a tail up. We assign probability $p$ to the event that we get a head up. We assign probability $1-p$ to the events that we get a tail up. We call a head up as a success. We call a tail up as a fail. If we toss this coin for many, many times then we get interesting random variables and distributions. Formally speaking, let us use random variable $X$ to denote the result of a Bernoulli trial. $X$ can take two values: $X$ takes 1 with probability $p$ ($P(X=1)=p$) and $X$ takes 0 with probability $1-p$ ($P(X=0)=1-0)$.

What is the mean value and variance of this Bernoulli trial?

The expected value of $X$ is $\mbox E(X)=1\times p+0\times(1-p)=p$ according to the definition of expected value. This agrees with out intuition: If we bet on the result of a coin toss — if we win \$1 for a head and and nothing for a tail and if the probability of getting a head is $p$, we expect to get \$ $n\cdot p$ after $n$ trials, and \$ $p$ after 1 trial.

The variance of $X$ according to definition is the expected value of the square deviation of the outcomes to their mean value, $$\mbox{Var}(X)=\mbox E\left(X-\mbox E(X)\right)^2=(1-p)^2\cdot p+(0-0)^2\cdot(1-0)=p(1-p)$$.

Interesting random variables and distributions will arise from repeated coin tosses. If we throw this coin for $n$ times, then the number of heads $X$ after $n$ tosses has a **binomial distribution**. The number of fails until the first success and including the first success has **geometric distribution**. The number of successes before a specific number of fails has a **negative binomial distribution**. We will explain why it is called negative binomial distribution. The time until the first success, if the rate of success is $\lambda$, and the probability of success in any infinitesimal time $\Delta t$ is $\lambda\times\Delta t$, has **exponential distribution**. Geometric distribution is a discretized version of exponential distribution. Exponential distribution is the continuous version of geometric distribution. The time to the $k^\mbox{th}$ success has a **Gamma distribution**. Negative binomial distribution is the discrete version of gamma distributions, and gamma distribution is the continuous version of negative binomial distribution. The time to the $k^\mbox{th}$ success out of the $n$ successes happening in a fixed time has **Beta distribution**. If we throw a dice instead of a coin, the numbers of times we get faces 1-6 out of $n$ throws has **multi-nominal distribution**. As we conduct more and more experiments, the average will have first moment, and second moment, with higher moments all vanish. It has **normal distribution**.

Let us find the probability of getting $k$ heads from $n$ tosses, which is the probability for $X=\sum_{i=1}^n X_i$ to be $k$, where $X_1, X_2, \cdots X_n$ are the outcomes from a single toss. $X$ has binomial distribution with rate $p$ and number of trials $n$.

What is the probability that we get 2 heads out of 5 toss? Let us use $1_1$ and $0_1$ to denote a success and a failure respectively from the first toss — $X_1=1$ if we get $1_1$ and $X_1=0$ if we get $0_1$. A sequence $1_1, 1_2, 0_3, 0_4, 0_5$ of tosses has 2 successes, happing with probability $p\cdot p\cdot (1-p)\cdot (1-p)\cdot (1-p)$, where $p$ is the probability of getting $1$ and $1-p$ is the probability of getting $0$. Any permutation of the sequence $1_1, 1_2, 0_3, 0_4, 0_5$ has 2 successes, and there are $5!$ of them. The $5!$ permutations fall into ${5\choose 2}={5!\over 2!\cdot(5-2)!}$ bins, with each bin containing the $2!(5-2)!$ sequences that are the same with the subscript is dropped. For example sequence $1_1, 1_2, 0_3, 0_4, 0_5$ and $1_2, 1_1, 0_3, 0_4, 0_5$ are in the same bin and they are the same sequence after their subscripts are dropped. Hence the probability of getting 2 successes out of 5 tosses is ${5\choose 2}p^2(1-p)^{5-2}$. In general the probability of getting $k$ successes out of $n$ tosses is ${n\choose k}p^k(1-p)^{n-k}$, where $0\le k\le n$. Thus a binomial random variables is from counting the number of successes in repeated Bernoulli trials.

What is the expected value of a binomial random variable $X$?

$$\begin{eqnarray*}

\mbox{E}X &=& \mbox{E}\sum_{i=1}^n X_i = \sum_{i=1}^n \mbox{E} X_i = n\cdot p.

\end{eqnarray*}$$

What is the variance of the binomial random variable? There is a easy way, and there is a hard way. The harder way is through definition.

$$\begin{eqnarray*}

\mbox{Var}(\sum_{i=1}^{n}X_{i}) & = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}-\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\

& = & \mbox{E}\left(\left(\sum_{i=1}^{n}X_{i}\right)^{2}-2\left(\sum_{i=1}^{n}X_{i}\right)\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)+\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\right)\\

& = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)^{2}-2\mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)+\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\

& = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)^{2}-\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\

& = & \sum_{i=1}^{n}\mbox{E}X_{i}^{2}+\sum_{i,j=1}^{n}\mbox{E}\left(X_{i}X_{j}\right)-\left(n\cdot p\right)^{2}\\

& = & n\cdot\left(p(1-p)+p^{2}\right)+n^{2}\cdot p^{2}-\left(n\cdot p\right)^{2}\\

& = & n\cdot p(1-p)

\end{eqnarray*}$$

The easier way is to note that the Bernoulli trials $X_1,\cdots,X_n$ are i.i.d. or independent, and identically distributed and that the variance of the sum of independent random variables is the sum of the variances of those variables: $$\mbox{Var}(\sum_i^n X_i)=\sum_i^n\mbox{Var}(X_i)=n\cdot p\cdot(1-p)$$.

Let $X_1$ be a binomial random variable that counts the number of successes in $n_1$ independent Bernoulli trials, with the probability of success in each trials being $p$. Let$X_1$ be a binomial random variable that counts the number of successes in $n_2$ independent Bernoulli trials, with the probability of success in each trials also being $p$. It follows that $X_1+X_2$ is a binomial distribution that counts the number of successes in $n_1+n_2$ independent Bernoulli trials, with the probability of success in each trials being $p$.

If we throw a dice instead of a coin, the outcome is a **categorical distribution** over the face numbers $\{1,2,3,4,5,6\}$. The probabilities for us to get each of the six numbers sum up to 1. In general, we can define a categorical distribution over all $K$ possible outcomes.

We can throw a dice repeatedly, and the numbers of occurrences of all 6 numbers is a **multinomial distribution**. What is the probability that out of $n$ independent throws, we get $1$ for $n_1$ times, get $2$ for $n_2$ times, get $3$ for $n_3$ times, and so on, supposing that in a single throw we get 1 with probability $p_1$, get $2$ with probability $p_2$, get $3$ with probability $p_3$ and so on? The probability of getting a single sequence is $$P(x_1, x_2,\cdots,x_n)=p_{x_1}\cdot p_{x_2}\cdots p_{x_n}=p_1^{\sum_{i=1}^n \delta(x_i=1)}\cdots p_K^{\sum_{i=1}^n \delta(x_i=K)}=p_1^{n_1}\cdots p_K^{n_K}$$. The probability of any of the $n!$ permutations of this sequence will be the same. The permutations of $n_k$ tosses of outcome $k$ for $k=1,\cdots,K$ will give us the same sequence. Hence the probability of getting $n_1$ $1$, $n_2$ $2$, $\cdots$, $n_K$ $K$ is $P(n_1,\cdots,n_K)={n! \over n_1!\cdots n_K!} p_1^{n_1}\cdots p_K^{n_K}$.

If we throw the dice twice, the outcome of the first throw is independent of the outcome of the second throw. But if we throw this dice for just once, the outcomes are correlated — if we get a 1, we cannot get a 2 out of a single throw.

The number of failed trials until we get the first success has geometric distribution. What is the probability of getting $k-1$ fails until the first success? The **probability mass function** is $P(X=k)=(1-p)^{k-1}p$ — We first get $k-1$ fails with probability $1-p$ each time, then we get a success with probability $p$. What is the probability for $X$ to be less than or equal to $k$ (for $X$ to be $1,2,\cdots,k$)? The **cumulative probability function** is $P(X\le k) = 1-P(X>k) = 1-(1-p)^k$ — we get failed trials for the first $k$ times. The mean value and variance of a geometric random variable can be found through elementary math.

A random variable with geometric distribution has **memory-less property**, $P(X>s+t)=P(X>s)P(X>t)$, the probability of getting fails in the first $s+t$ trials equals the probability of getting fails in the first $s$ trials and the probability of failing in the next $t$ trails, and the probability of failing in the next $t$ trials equals the probability of failing in the first $t$ trials (by the time homogeneous of our coin). The geometric distribution is also the distribution of **maximum entropy** over distributions with support on non-negative numbers with mean value $\mbox{E}(X)$.

Suppose the arrival of the school bus has geometric distribution, then no matter what time we arrive at the bus station, we expect to wait the same amount of time to get the next bus, which is the average waiting time. If it is the case, we can use geometric distribution to model the arrival time of school bus. Otherwise, a geometric distribution is not a pretty good distribution for this type of model.

A negative binomial distribution is the distribution of the number of successes before we get $r$ fails, including the $r^\mbox{th}$ fail. A geometric distribution is a special case of negative binomial distribution. It is the distribution of the time until the first success, if we just reverse success and fail.

What is the probability mass function of the negative binomial distribution? What is the probability that we have $K$ successes until we get $R$ fails? We have conducted a series of experiments, and in a series of experiences, we have $K$ successes, and we have $R$ fails, so we have a specific sequence with $K$ successes and $R$ fails, where the probability is $P$ to the power of $K$, and $1-P$ to the power of $R$, because we have $K$ successes, and $R$ fails. The number of sequences that we have for us to have K successes, and R fails, and that the last one is always a fail, so we have already worked this out for many times.${\bf P}(X=k) = {k+r-1\choose k}p^{k}\cdot \left(1-p\right)^r$

That is how we get the probability mass function for negative binomial distribution. The expected value of negative binomial distribution is $pr\over 1-p$. This is intuitively.

Why it is called a negative binomial distribution? The reason is that it is related to the binomial distribution in a foreign sense. The probability for the number of successes before R fails. This is a random variable, because out of different samples, we will get different number of successes before R fails. The probability for this random variable to be less than $S$ is the same as the probability that we have more than $R$ successes after $R+S$ trials. Our successes after $R+S$ trials is another random variable, which is binomial random variable. What we are saying here, is that the probability for this negative binomial random variable to be less than $S$, is going to probability for a binomial random variable, which is the result of $R+S$ trials, to be greater than $R$. That is the relationship between the negative binomial distribution, and the binomial distribution, and it is the reason we call this negative binomial distribution.

$P(\mbox{number of success before r failure}\le s) = P(\mbox{more than r successes after r+s trails})$

The next distribution is the exponential distribution. The exponential distribution is the limit of geometric distribution. Previous, in geometric distribution, we conduct one experiment in 1 unit time. Now here we see that we can conduct a fractal experiment, in a sense. We can conduct experiment in $\Delta t$ time, except that the probability of success in this $\Delta t$ is $\lambda\cdot\Delta t$, so we have a constant rate. The smaller the interval, the smaller the probability of success.

This is how we get the probability for $X$ to be greater than $t$ is going to be this one.

$P(X\ge t) = \lim_{\Delta t\to 0}(1-\Delta t\cdot\lambda)^{{t\over\Delta t}-1}\cdot (\Delta t\cdot\lambda) / \Delta t = \lambda \exp(-\lambda t)$

If we take the derivative, then we will get the probability density function for exponential distribution. This is how exponential distribution arise from repeatedly throw coins. Similarly, we can find out the expected value of the exponential distribution, and the variation of exponential distribution. Since we have already said that the exponential distribution is continuous to its counterpart of a geometric distribution, then the same memory layers property of a geometric distribution can be carried over to a random variable of exponential distribution. That is how we get the memory-less property here.

Another very interesting property about exponential distribution is, suppose we have two exponential distributions. $Y1$, and $Y2$. Then, the minimum of $Y1$ and $Y2$ is also exponential distribution. We call it memory-less property. This is a distribution that has the maximum entropy, in a sense that each time we start afresh, and the past really does not matter. Each time, we have the same rate of event happening. That is why exponential distribution occurs a lot in engineering, and in other branches of science.

$\textbf{memoryless property}$: ${\bf P}(X>s+t) = {\bf P}(X>s){\bf P}(X>t)$

Exponential distribution is the $\textbf{only}$ continuous distribution with memoryless property

$\min(Y_1,Y_2)$ has exponential distribution with rate $\lambda_1+\lambda_2$ if $Y_1$ and $Y_2$ have exponential distribution with rate $\lambda_1$ and $\lambda_2$ respectively

The next distribution is the poisson distribution. The poisson distribution is the number of success in unit time. We talk about the time until the first fail, the time until the first success, the time until $R$ success, and so on. Here, we’re looking at a fixed interval. Then we’re looking at the number of successes in this interval. So, what is the distribution of this one? It also arises from taking the limit from binomial distribution. We take a unit time, and we divide this unit time in equal parts. The rate of success is $\lambda$. The probability of success of each equal part, each unequal part is going to be $\lambda\over n$.

If we take N to the limit, we will get this one.

$\lim_{n\to\infty}{n\choose k}({\lambda\over n})^k(1-{\lambda\over n})^{n-k}=e^{-\lambda}{\lambda^k\over k!}$

E to the power of minus lambda, which is a scaling factor to make this to be 1. We also have this one. The reason is that, this one, if we take N to the limit, it is some kind of exponential term. The reason is that this one … 1 plus X to the power of 1 over X, as X, 10 to zero, is going to be E. We can change the size to other things. We get little variation from this end result of E. That is how we get this one. I think the main tool for us to get this distribution from, by taking the limit as this one.

We can similarly find out the expected value of this Poisson distribution.

${\bf E}X = \lambda$

There’s a square here, and the variance of this Poisson distribution. A good property of Poisson distribution is that, if we have $Y1$ and $Y2$ both to be Poisson distribution with rates $\lambda1$ and $\lambda2$, then the sum of those two Poisson distributions is going to be another Poisson distribution, with the rates $\lambda1 + \lambda2$.

We also have splitting property, which is that if $X$ has a Poisson distribution with rates $\lambda$. We sample from a Poisson distribution, and we conduct Bernoulli trials over this unit time, corresponding to each success. We name these as success with probability of $P$, and we name these as a failure as probability of $1-P$. The result of the name success is still a Poisson distribution, but the rate is thinner, which is $P\cdot \lambda$. This is a thinning property of Poisson distribution.

The next distribution is gamma distribution, which arises often as a distribution of exponential family. The gamma distribution is the time until the $k^{\rm th}$ success. We talk about the negative binomial distribution, right? The gamma distribution is continuous counterpart of negative binomial distribution. Because gamma distribution is a continuous random variable, so we have a probability density function. The way to find out the probability density function of a gamma distribution is by considering the two distributions.

First, let’s specify that $X$ is the time to the $k^{\rm th}$ event in a Poisson process, meaning that if we sample the time to the next event of success, then we sample the time to the next success, and so on, so forth. Let’s use $Y$ to count the number of events in the interval from zero to $x$. The probability for $X$ to be less than $x$ is a cumulative probability function of a gamma distribution. The time to the $k^{\rm th}$ success is less than $x$, meaning that, in the interval from zero to $x$, there will be greater than or equal to $K$ number of events. If we talk about greater or equal to $K$ number of events, we’re talking about a Poisson distribution, and we say it is a random variable of Poisson distribution to take $K+1$, and so on, and so forth.

If we write this out, we get this one.

$P(X\le x)=1- P(Y<k)=1 – \sum_{i=0}^{k-1}{(\lambda x)^k\over k!}\exp(-\lambda x)$

That has the probability for this gamma distribution to be less than $x$.Because we are talking about continuous random variable, we can take the derivative, and find out the probability density function. This is how we get the probability density function of a gamma distribution.

${d\over dx}P(X\le x)={\lambda\over (k-1)!}(\lambda x)^{k-1}\exp(-{\lambda x})$

All of those distributions, if you take to the limit, in a sense, is going to be normal distribution. The first moment is the expected value of random variable X, and the second moment is the expected value of X squared. The higher order moment is the expected value of X cubed, and so on, and so forth. What we know is, if we take the random variable to some sequence of random variable to some limit, then the higher order moments will all disappear, and we’ll end up with only having first moment, and the second moment, and that is how we get the normal distribution.

$f(x) = {1\over \sqrt{2\pi}\sigma}\exp(-{1\over 2}({x-\mu\over\sigma})^2)$

– ${\bf E}X =\mu$

– $\sigma^2 X = \sigma^2$

Central Limit Theorem

– Let $X_1,X_2,\dots$ be i.i.d. with $\bf EX_1=0$ and $\bf EX_1^2=\sigma^2<\infty$

– Then $\frac{S_n}{\sqrt n}=\frac{\sum_{i\le n}X_i}{\sqrt n}\to\cal N(0,\sigma^2)$

Proof: moments higher than 2 converges to 0

– $\phi(t)=\bf E e^{itX}=\phi(0)+\phi'(0)t+\frac{1}{2}\phi”(0)t^2+o(t^2)\to 1-\frac{\sigma^2 t^2}{2}+o(t^2)$ as $t\to 0$

– $\bf Ee^{\frac{itS_n}{\sqrt t}}=(\phi(\frac{t}{\sqrt n}))^n\to e^{-\frac{1}{2}\sigma^2t^2}$ as $n\to\infty$

]]>

The Engineer had painted two out of four walls and as expected he had emptied the bucket of paint, followed by the Physicist’s turn to paint the walls. The Physicist never started to paint the walls as he was trying to figure out the calculation of how best to use the paint bucket so as to paint all 4 walls of the room. However, the Mathematician had painted all 4 walls of the room and surprisingly also had the bucket of paint intact. Now let us reason out the approaches each of them had taken in painting the walls.

The Engineer never really thought about if the given quantity of paint would be enough to paint all 4 walls of the room. So he went about painting the walls until the paint was exhausted. As we saw earlier, the physicist was taking time to figure out the calculations before he actually started painting the walls. The reason being, a physician would always work towards computing the laws of the system and hence a lot of thought is required before solving the problem. On the other hand, the mathematician just painted the rational numbers on all 4 walls (countable infinite). This example shows that different people deal with problems in different ways.

The dataset that tracks the location of the people is useful in many aspects. One of the papers is to model spreading of epidemics like Malaria through human movement. This study focuses on human movement from poor employment regions to rich employment regions and vice versa, as people tend to reside at places that are less expensive i.e., poor employment regions. So, the stochastic modeling of population mobility will help us identifying the cause of Malaria diffusion in social network.

Similarly, another paper deals with “Poverty Analysis in Senegal” wherein the information flow in the network will play a significant role in determining the poverty of Senegal. The different areas in Senegal are represented as nodes of a virtual network. If a particular area is poor, then that node in the network would be least visited. So, we can eradicate poverty and also identify areas that are poor by implementing better models of information flow in the network. The researcher has used the Google page rank approach in ranking the poverties of different cities in Senegal.

Consequently, the price variation of Millet was captured in Senegal using a satellite map of production. The researcher again focused on information flow and deduced that the Millet would be fairly priced in all regions of the country only if information flow is handled well.

The poverty is a multidimensional entity wherein poverty can be measured in terms of wealth, education, healthcare and so on. So, we need to count in a number of factors to determine poverty. To find answers to these factors, a census must be conducted across all the places at Senegal. This involves a large number of social resources like people, time, travel, cost etc. Even then, it is very hard to cover the entire population or vast population of the country by conducting surveys. In order to overcome the shortcomings of surveys, we can use mobile phone dataset as a suitable replacement. The reason being, in this present generation almost 96% of the world uses a mobile phone and the connectivity of mobile phones have reached to such an extent that even people residing in remote villages could be reached. This aids in collecting survey results at a very fine level. The important factors in mobile phone datasets are location/mobility traces of people, interaction of cities/communities amongst others.

Another paper that details the “Survey results on mobile phone datasets” focuses on the fact that inferences drawn must be generalized to larger extent of society and less use of social resources. The researchers gather the Call Data Records (CDR’s) – How, when and with whom communication happened that includes message and call data.

Based on these CDR’s, a social network was constructed with nodes, as people and a link exist between two people if they have reciprocity in communication between them. Here, we can either implement a directional link to indicate source as the caller and destination as callee or associate a weight to the unidirectional link indicating the number of times communication happened between two entities.

The researchers used the mobility traces of people to determine the frequent locations visited by them. Also, the focus was on information diffusion in the network to represent problems such as spreading of epidemics, spreading of viruses through Bluetooth or multimedia messages, viral marketing amongst others. The behavior of a community was studied based on the digital signature of a city or a group of people. It was noticed that the people belong to the same community tend to communicate amongst themselves rather than communicating with people outside their community. This also brings in the influence of ethnicity in the population behavior.

The Gravity law states that the probability that two people are connected decreases with increase in spatial distance separating them. Secondly, the duration of call increases with increase in separating spatial distance. Both these points are valid only until a threshold value is reached in distance separating them. After which, the values become constant or very less changes are displayed.

Another interesting observation made in the above paper was duration for people to be in phone contact. As per the researchers, on an average people relocate every 7 years and when relocated they get to meet new people and henceforth-new contacts will be made. In order to maintain the contacts in the phone, some of the old unused contacts would be deleted.

Now we see how people benefit from contacting the weak ties for employment opportunities. The strong ties (people with whom we communicate on a daily basis) know pretty much the same information/contacts and hence not much benefit would be seen in terms of new job opportunities. However, when weak ties are contacted the network grows beyond the scope of know information and chances are high that relatively more job opportunities can be explored.

The paper also talks about the usage of social networks in disasters. If an emergency situation occurs (bomb blast/plane crash/earthquake), it is noticed that the eyewitnesses are contacted immediately after these events. Also, by analyzing the social amplifiers (nodes with highest degree) and its immediate neighbors it is possible to detect the emergencies. However, work has to be done in predicting the emergencies by using this social network.

By studying the mobility traces of people, it helps in determining the economy of the country. The more mobile the better economy and few larger airtime purchases indicate better economy. Consequently, the mobility of people helps in Urban Planning (improving the roads in developing countries) and also in better planning of transportation network to ease traffic congestion.

In spite of all the above benefits from mobile phone dataset, the privacy issues are often challenging. It is important for the service providers to not reveal the personal information of users like phoning number and age. Hence, suitable anonymization techniques must be employed to preserve the identity of the user.

Dwelling into the social dynamics aspect, the models such as flocking model, language model, culture model are used where in the impact of flock movement is evaluated on real-time problems. For example, the herd movement of people impacting the stock prices in stock market. Secondly, the researchers talk about interaction of particle systems to check if a generalizable claim can be made on the observed pattern across all datasets.

Netlogo is another language used primarily for simulation, used for generating game theory model in the formation of network and to facilitate information exchange so as to maintain fairness in the social network.

Urban dynamics is one interesting topic in social dynamics field. There are several good textbooks on this topic such as “Cities and the Complexity” and Urban Dynamics. Researchers used agent-based model to simulate the movement of people, resources, populations and so on and got some interesting conclusions. There are also some soft wares for simulating urban dynamics. One example is UrbanSim, where the urban dynamic is simulated at the agent and individual level. SimCity, which is a famous game, is an interactive simulation of urban dynamics.

Stochastic process can be also used to model the dynamics in system biology and chemical reaction. By discretizing the time, we can make inference about the concentrations of different chemicals we cannot observe. One interesting thing is the similar methodology here can be used to study the asymptotic behavior of our social systems because there are simple corresponding relationships between people of different and different chemicals.

Financial Modeling is also a field full of stochastic process. One famous process is Brownian process. Brownian process is continuous but does not have derivative. This means that the increment is always small but when the time interval goes to zero, the change will be dramatic. An advanced process named Levy process, which is also widely used in this field, is a combination of Brownian motion and jump process. In contrast with the previous dynamic model, there are currently no exact inference algorithms for these processes. The common tool is simulation.

The most used tools in this course for inference in dynamic modeling is variational inference method, which has strong relationship with the law of large numbers and the large deviation theory.

]]>We work backward here.

\begin{align*}

& P(x_{k}|y_{1,\dots,T})\\

= & \sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,T})P(x_{k+1}|y_{1,\dots,T})\\

= & \overset{P(x_{k}|x_{k+1}y_{1,\dots,T})=\frac{P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}{\sum_{x_{k}}P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}=P(x_{k}|x_{k+1}y_{1,\dots,k})}{\sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,k})P(x_{k+1}|y_{1,\dots,T})}\\

= & \sum_{x_{k+1}}\frac{P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}{\sum_{x_{k}}P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}\cdot P(x_{k+1}|y_{1,\dots,T})

\end{align*}

Hence to estimate the distribution of $x_{k}$ from $y_{1,\dots,T}$ in Kalman smoothing, we first need to find the joint distribution $P(x_{k+1}|x_{k})P(x_{k}|y_{1,\dots,k})$ just after the prediction step to estimate the distribution of $x_{k+1}$ in Kalman filtering, condition the distribution of $x_{k}$ from predicted value of $x_{k+1}$, join with the distribution of $x_{k+1}$ in Kalman smoothing, and marginalize over the distribution of $x_{k+1}$ in Kalman smoothing:

\begin{align*}

\left(\begin{matrix}x_{k|k}\\

x_{k+1|k}

\end{matrix}\right) & \sim\mathcal{N}\left(\left(\begin{matrix}\hat{x}_{k|k}\\

F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}

\end{matrix}\right),\left(\begin{matrix}P_{k|k} & P_{k|k}F_{k+1}^{T}\\

F_{k+1}P_{k|k} & F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}

\end{matrix}\right)\right)\\

X_{k|k}|x_{k+1|k} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\

& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)\\

X_{k|k} & =\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right)+\epsilon\\

& \mbox{where }\epsilon\sim\mathcal{N}(0,P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k})\\

X_{k|T} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(\hat{x}_{k+1|T}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\

& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}P_{k|T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}+P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)

\end{align*}

Hence

- innovation in smoothing is $\delta=\left(\hat{x}_{k+1|T} -\hat{x}_{k+1|k}\right)$
- Covariance of innovation is $P_{k+1|k}=F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}$
- Gain in smoothing is $G=P_{k|k}F_{k+1}^{T}P_{k+1|k}^{-1}$
- Updated (a posterior) state estimate $\hat{x}_{k|T}=\hat{x}_{k|k}+G\delta$
- Updated (a posterior) covariance matrix $P_{k|T}=P_{k|k}+G(P_{k+1|T}-P_{k+1|k})G^T$

We will estimate $F$, $H$, $Q$ and $R$ from observations $z_1,\dots z_T$ about the following system

- $X_t = F X_{t-1} + w_t$, where $w_k\sim{\cal N}(0, Q)$ are i.i.d.
- $Z_t = H X_t + v_t$, where $v_k\sim{\cal N}(0, R)$ are i.i.d. and independent of $w_{t’}$ for all $t’$

Likelihood of $X_{1,\dots,T},Z_{1,\dots,T}$

\begin{align}

P(X_{1,\dots,T},Z_{1,\dots,T})

=&P(X_{1})\prod_{t=2}^{T}P(X_{t}|X_{t-1})\prod_{t=1}^{T}P(Z_{t}|X_{t})\\

=&\left(2\pi\right)^{-\frac{d_X}{2}}\left|\Sigma_{0}\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)\right)\\

&\prod_{t=2}^{T}\left(2\pi\right)^{-\frac{d_X}{2}}\left|Q\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)\right)\cdot\\

&\prod_{t=1}^{T}\left(2\pi\right)^{-\frac{d_Z}{2}}\left|R\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)\right)

\end{align}

Expected log likelihood observing $Z_{1,\dots,T}$

\begin{align}

\mathbb{E}_{X_{1,\dots,T}}\log P(X_{1,\dots,T},Z_{1,\dots,T})

=& {\rm constant} {-\frac{1}{2}}\log\left|\Sigma_{0}\right|-{1\over 2}\mathbb{E}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)+\\

&\sum_{t=2}^{T}{-\frac{1}{2}}\log\left|Q\right|-{1\over 2}\mathbb{E}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)+\\

&\sum_{t=1}^{T}{-\frac{1}{2}}\log\left|R\right|-{1\over 2}\mathbb{E}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)

\end{align}

Estimating state transition model $F$ and observation model $H$

\begin{align}

\frac{\partial}{\partial F_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&Q^{-\mbox{T}}\sum\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}=Q^{-\mbox{T}}F\sum\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}\\

\Rightarrow&F=\left(\sum_{t=2}^{T}\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}\right)\left(\sum_{t=2}^{T}\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}\right)^{-1}\\

\frac{\partial}{\partial H_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0\Rightarrow&H=\left(\sum_{t=1}^{T}Z_{t}\mathbb{E}X_{t}^{\mbox{T}}\right)\left(\sum_{t=1}^{T}\mathbb{E}X_{t}X_{t}^{\mbox{T}}\right)^{-1}

\end{align}

Solution: write down an equation involving

$\left(\matrix{\frac{\partial\log P}{\partial F_{11}} & \dots & \frac{\partial\log P}{\partial F_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial\log P}{\partial F_{d1}} & \dots & \frac{\partial\log P}{\partial F_{dd}} }\right)$

and solve $F$, where $d$ is the dimensionality of $X$. Apply the same procedure to solve $H$.

Estimating variance of process noise $Q$ and observation noise $R$

\begin{align}

\frac{\partial}{\partial Q_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&\left(T-1\right)Q-\sum_{t=2}^{T}\left(\mathbb{E}X_{t}X_{t}^{\mbox{T}}-F\mathbb{E}X_{t-1}X_{t}^{\mbox{T}}-\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}F^{\mbox{T}}+F\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}F^{\mbox{T}}\right)\stackrel{\mbox{set}}{=}0\\

\Rightarrow&Q=\frac{1}{T-1}\sum_{t=2}^{T}\left(\mathbb{E}X_{t}X_{t}^{\mbox{T}}-F\mathbb{E}X_{t-1}X_{t}^{\mbox{T}}-\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}F^{\mbox{T}}+F\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}F^{\mbox{T}}\right)\\

\frac{\partial}{\partial R_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&R=\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}Z_{t}Z_{t}^{\mbox{T}}-H\mathbb{E}X_{t}Z_{t}^{\mbox{T}}-Z_{t}\mathbb{E}X_{t}^{\mbox{T}}H^{\mbox{T}}+H\mathbb{E}X_{t}X_{t}^{\mbox{T}}H^{\mbox{T}}\right)

\end{align}

Solution: Note that the determinant of a matrix is $|A| = \sum_{\sigma\in S_n}{\rm sgn}(\sigma)\prod_{k=1}^d A_{k\sigma_k}$. Hence $\left(\matrix{ \frac{\partial|A|}{\partial A_{11}} & \dots & \frac{\partial|A|}{\partial A_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial|A|}{\partial A_{d1}} & \dots & \frac{\partial|A|}{\partial A_{dd}}}\right)\cdot A^{\mbox{T}}=|A|\cdot I$ and $\frac{\partial\log|A|}{\partial A}=\frac{|A|}{|A|}A^{-T}$.

Our goal is to find inverse of the covariance matrix $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)$. In order to do that, we first use the 2nd column $\left(\matrix{\Sigma_{12}\cr\Sigma_{22}}\right)$ to Gaussian-eliminate $\Sigma_{21}$, which is equivalent to right multiply the covariance matrix with $\left(\matrix{I & 0 \cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$. We then use the resulting 2nd row $\left(\matrix{0 & \Sigma_{22}}\right)$ to eliminate $\Sigma_{12}$, which is equivalent to left multiply the covariance matrix with $\left(\matrix{ I & \Sigma_{12}\Sigma_{22}^{-1} \cr 0 & I}\right)$. We performed the following operation.

$$

\begin{align*}

& \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)

\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)\\

= & \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr & I}\right)

\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\cr 0 & \Sigma_{22}}\right)\\

= & \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right).

\end{align*}

$$

If we move $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$ and $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)$ to the right hand side of the equation above, then $\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$ and the resulting $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)^{-1}$ to the left hand side of the equation, we get the following.

$$

\begin{align*}

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)^{-1}

= & \left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)

\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)

\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

\end{align*}

$$

**Application of Schur complement to find the conditional probability density function of a multivariate normal distribution**. Let $x = \left(\matrix{X_1\cr X_2}\right)\sim \mathcal{N}\left(\left(\matrix{\mu_1\cr\mu_2}\right), \left(\matrix{\Sigma_{11} &\Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right) \right)$ be multivariate normal distribution, where $X_1$ and $X_2$ are vectors of dimension $p$ and $q$ respectively, and we want to find the conditional distribution of $X_1$ on $X_2=x_2$.

The probability density function of $X$ is

\begin{align*}

& f\left(\matrix{x_{1}\cr x_{2}}\right)\\

\propto & \exp\left(-\frac{1}{2}

\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1}

\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\

= & \exp\left(-\frac{1}{2}

\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)

\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)

\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)

\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\

=& \exp\left(\left(\begin{matrix}\left(x_{1}-\mu_{1}\right)^{T}-\left(x_{2}-\mu_{2}\right)^{T}\Sigma_{22}^{-1}\Sigma_{21} & \left(x_{2}-\mu_{2}\right)^{T}\end{matrix}\right)

\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)

\left(\matrix{x_{1}-\mu_{1}-\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right)\cr x_{2}-\mu_{2}}\right)\right).

\end{align*}

Hence $X_{1}|X_{2}=x_{2}\sim\mathcal{N}\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)$.

1. Convexity of $f^∗(y)=\sup_x(y^\mbox{T}⋅x−f(x))$: We want to show that $f^*(\alpha_1*y_1+\alpha_2*y_2)\le \alpha_1*f^*(y_1)+\alpha_2*f^*(y_2)$ for any $0\le\alpha_1,\alpha_2\le 1$ with $\alpha_1+\alpha_2=1$, $y_1$ and $y_2$. In fact we have

$$\begin{eqnarray*}

f^{*}(\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2}) & = & \sup_{x}((\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2})^{\mbox{T}}x-f(x))\\

& = & \sup_{x}(\alpha_{1}(y_{1}^{\mbox{T}}x-f(x))+\alpha_{2}(y_{2}^{\mbox{T}}x-f(x)))\\

& \le & \alpha_{1}\sup_{x_{1}}((y_{1}^{\mbox{T}}x-f(x)))+\alpha_{2}\sup_{x_{2}}((y_{2}^{\mbox{T}}x-f(x)))\\

& = & \alpha_{1}f^{*}(y_{1})+\alpha_{2}f^{*}(y_{2})\\

\end{eqnarray*}$$

2. If $f(x)$ is differentiable in $x$, the supremum $\sup_x y^\mbox{T}\cdot x – f(x)$ is attained at $x^*$ with ${\partial\over\partial x}=0$, i.e., $y-f'(x^*)=0$ or $y$ is the derivative of $f$ at point $x^*$. By definition of supremum, $y^\mbox{T}\cdot x-f(x)\le f^*(y)$. We call $(x^*, f'(x^*))$ Legendre conjugate pair.

3. Double Legendre-Fenchel dual $f^{**}(x)$ as the Legendre-Fenchel dual of $f^*(y)$ is convex as the result of 1. The double Legendre-Fenchel dual of a convex function is itself. Suppose $g(x)\ge f^{**}(x)$ is a convex function below $f$. Then $\max(f^{**},g)$ is a convex function below $f$. Hence double Legendre-Fenchel dual of a function is the maximal convex function below.

4. Let $x=(x_1,x_2,\cdots)$ be the mean parameters of categorical distribution. Then $\phi(x)=x^\mbox{T}\cdot\log x=(\log x)^\mbox{T} \cdot x$ is the entropy of this distribution, $\triangledown \phi(x) = 1+\log(x)$, and

$$\begin{eqnarray*}

d_{\phi}(y,x) & = & \phi(y)-\phi(x)-\triangledown\phi(x)\cdot(y-x)\\

& = & y^{\mbox{T}}\cdot\log y-x^{\mbox{T}}\cdot\log x-(1+\log x)^{\mbox{T}}(y-x)\\

& = & y^{\mbox{T}}\cdot\log y-y^{\mbox{T}}\cdot\log x\\

& = & d_{\mbox{KL}}(x,y)

\end{eqnarray*}$$

5. Since $\phi$ is convex, $\phi(\alpha\cdot x+(1-\alpha)\cdot y)\le\alpha\phi(x)+(1-\alpha)\phi(y)$ for $0\le \alpha\le 1$, or

$$\begin{eqnarray*}

\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{(1-\alpha)(\phi(y)-\phi(x))}{(1-\alpha)(y-x)}=\frac{\phi(y)-\phi(x)}{(y-x)},\\

\phi'(x)=\lim_{\alpha\to1}\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{\phi(y)-\phi(x)}{(y-x)},\\

d_{\phi}(y,x)=\phi(y)-\phi(x)-\phi'(x)\cdot(y-x) & \ge & 0.

\end{eqnarray*}$$

6. For $0\le \alpha_1,\alpha_2\le 1$ and $\alpha_1+\alpha_2=1$,

$$\begin{eqnarray*}

d_{\phi}(\alpha_{1}y_{1}+\alpha_{2}y_{2},x) & = & \phi(\alpha_{1}y_{1}+\alpha_{2}y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\

& \le & \alpha_{1}\phi(y_{1})+\alpha_{2}\phi(y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\

& = & \alpha_{1}\left(\phi(y_{1})-\phi(x)-\phi'(x)(y_{1}-x)\right)+\alpha_{2}\left(\left(\phi(y_{2})-\phi(x)-\phi'(x)(y_{2}-x)\right)\right)\\

& = & \alpha_{1}d_{\phi}(x,y_{1})+\alpha_{2}d_{\phi}(x,y_{2}).

\end{eqnarray*}$$

7.

$$\begin{eqnarray*}

d_{\phi_{1}+\phi_{2}}(x,y) & = & (\phi_{1}+\phi_{2})(y)-(\phi_{1}+\phi_{2})(x)-(\phi_{1}+\phi_{2})(x)\cdot(y-x)\\

& = & \cdots=d_{\phi_{1}}(x,y)+d_{\phi_{2}}(x,y)

\end{eqnarray*}$$

8. $\{x|d_\phi(x,u)=d_\phi(x,v)\}$ is a hyper plane: $d_{\phi}(x,u)=\phi(x)-\phi(u)+\phi'(u)(x-u)=\phi(x)-\phi(v)+\phi'(v)(x-v)=d_{\phi}(x,v)$. Rearranging the terms $\phi(u)-\phi(v)+\phi'(u)\cdot u-\phi'(v)\cdot v-(\phi'(u)-\phi'(v))\cdot x=0$.

9. We expand the expectationon $\mbox{E}_{X}d_{\phi}(X,u)=\mbox{E}\left(\phi(X)-\phi(u)-\phi'(u)(X-u)\right)=\mbox{E}\phi-\phi(u)-\phi'(u)(\mbox{E}X-u)$, and set its partial derivative over $u$ to be 0: $\frac{\partial}{\partial u}\mbox{E}_{X}d_{\phi}(X,u)=-\phi'(u)-\phi”(u)(\mbox{E}X-u)+\phi'(u)=0$. Since $\phi$ is convex, $\phi”\ge0$ and $u=\mbox{E}X$ minimizes expected K-L divergence.

10. $\psi(\theta)$ is convex-conjugate of $\phi(x)$, and $(\theta_1, x_1)$ and $(\theta_2,x_2)$ are conjugate pairs.

$$\begin{eqnarray*}

d_{\phi}(x_{2},x_{1}) & = & \phi(x_{2})-\phi(x_{1})-\phi'(x_{1})(x_{2}-x_{1})\\

& = & \theta_{2}^{\mbox{T}}\cdot x_{2}-\psi(\theta_{2})-\left(\theta_{1}^{\mbox{T}}\cdot x_{1}-\psi(\theta_{1})\right)-\theta_{1}(x_{2}-x_{1})\\

& = & \psi(\theta_{1})-\psi(\theta_{2})-x_{2}^{\mbox{T}}\left(\theta_{2}-\theta_{1}\right)\\

& = & d_{\psi}\left(\theta_{1},\theta_{2}\right).

\end{eqnarray*}$$

11

$$\begin{eqnarray*}

H(q)-\sum_{i}\alpha_{i}\left(E_{q}(f_{i})-E_{\tilde{p}}(f_{i})\right) & \leftarrow & \sum_{k}q_{k}\log q_{k}-\sum_{i}\alpha_{i}\left(\sum_{k}f_{i}(x_{k})q_{k}-E_{\tilde{p}}(f_{i})\right)\\

\frac{\partial}{\partial q_{k}} & = & 1+\log q_{k}-\sum_{i}\alpha_{i}f_{i}(x_{k})\stackrel{\mbox{set}}{=}0\\

\Rightarrow q_{k} & \propto & \exp\left(\sum_{i}\alpha_{i}f_{i}(x_{k})\right)\\

& & \mbox{take to the limit.}

\end{eqnarray*}$$

12. Given $q(x)=\exp(\theta\cdot g(x)-\Psi(\theta))$, $\log q(x)=\theta\cdot g(x)-\Psi(\theta)$. Setting partial derivative of log likelihood over parameter to 0: $\frac{\partial}{\partial\theta}\sum_{k}(\theta\cdot g(x^{(k)})-\Psi(\theta))=\sum_{k}g(x^{(k)})-N\cdot\frac{\partial}{\partial\theta}\Psi(\theta)\stackrel{\mbox{set}}{=}0$, we get $\mu=\frac{1}{N}\sum_{k=1}^{N}g(x^{(k)})$.

13. To find $\Psi^{*}(\mu)=\sup_{\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)$, we take the derivative of $\mu\cdot\theta-\Psi(\theta)$ over $\theta$: $$\frac{\partial}{\partial\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)=\mu-\underbrace{\frac{\sum_{x}f(x)\cdot\exp(\theta\cdot f(x))}{\sum_{x}\exp\left(\theta\cdot f(x)\right)}}_{\triangledown_{\theta}\Psi=\mbox{E}(f(x))=\mu}\stackrel{\mbox{set}}{=}0.$$

Hence $\mu=\triangledown_{\theta}\Psi$, and $$\Psi^{*}(\mu)=\theta\cdot\triangledown_{\theta}\Psi-\Psi(\theta)=\mbox{E}(\frac{\theta\cdot f(x)}{\Psi(\theta)})=\sum_{x}p(x;\theta)\log p(x;\theta)=-H(\theta).$$

1. There are 2 events in this system: weather change from sunny to rainy, and from rainy to sunny. Let the rate of the first event be $\alpha$ per day, and the rate of the latter event be $\beta$ per day. The number of events in $\Delta t$ days will be $\alpha\cdot\Delta t$ and $\beta\cdot\Delta t$ for $\Delta t$ small enough. This system is an inhomogeneous Poisson process.

Using first order approximation, the state transition kernel of one-day and one-hour is the following where $\Delta t$ takes value 1 and $1\over 24$ respectively,

$$\begin{matrix}\mbox{sunny}\\

\mbox{raining}

\end{matrix}\left(\begin{matrix}1-\alpha & \alpha\\

\beta & 1-\beta

\end{matrix}\right)\cdot\Delta t.$$

Refer lecture slides of 09/01 for sampling in R.

2. Refer lecture slides of 09/01 for EM in R.

3. If we do not have observations, we take $P(Y_t|X_t)=1$. In other words, we do not update latent state distribution.

4. Refer stochastic kinetic model algorithm of 10/01/2015.

]]>Brownian motion is the main process for the calculus of continuous processes, which was described by Botanist R. Brown as the motion of a pollen particle suspended in fluid in 1828. It was observed that a particle moved in an irregular, random fashion. A. Einstein, in 1905, argued that the movement is due to bombardment of the particle by the molecules of the fluid, he obtained the equations for Brownian motion. In 1900, L. Bachelier used the Brownian motion as a model for movement of stock prices in his mathematical theory of speculation. The mathematical foundation for Brownian motion as a stochastic process was done by N. Wiener in 1931, and this process is also called the Wiener process.

The Brownian Motion process B(t) serves as a basic model for the cumulative effect of pure noise. If B(t) denotes the position of a particle at time t, then the displacement B(t) − B(0) is the effect of the purely random bombardment by the molecules of the fluid, or the effect of noise over time t.

Brownian motion {B(t)} is a stochastic process with the following properties.

- (Independence of increments) B(t) − B(s) for t > s is independent of the past, that is, independent of $ B(u) , 0 \leq u \leq s$, or of $ {F (s)}$ , the ${\sigma-algebra}$ generated by ${B(u), u \leq s}$.
- (Normal increments) ${B(t) − B(s)}$ has Normal distribution with mean $ {0}$ and variance ${t − s}$. This implies taking $ {s = 0}$ that $ {B(t) − B(0)}$ has $ {N (0, t)}$ distribution.
- (Continuity of paths) $ {B(t), t ≥ 0}$ are continuous functions of $ {t}$.

Sometimes, another version of Brownian Motion, ${W(t)=B(t+s)-B(s)}$, is used, as so called “Wiener Process” or “Standard Brownian Motion”.

If the process is started at $ {x}$, $ {B(0) = x}$, then $ {B(t)}$ has the $ {N (x, t)}$ distribution. More generally, the conditional distribution of $ { B(t + s)}$ given that $ {B(s) = x}$ is $ {N (x, t)}$. The transition function $ {P (y, t, x, s)}$ is the cumulative distribution function of this distribution,

$ \displaystyle P (y, t, x, s) = P(B(t + s) ≤ y|B(s) = x) = P_x (B(t) ≤ y)\ \ \ \ \ (1)$

The density function of this distribution is the transition probability density function of Brownian motion,

$ \displaystyle p_t (x,y) = (1/\sqrt{2\pi t}) e^{-{(y-x)^2/2t}}\ \ \ \ \ (2)$

The finite-dimensional distributions can be computed with the help of the transition probability density function, by using independence of increments.

$ \displaystyle \begin{split}P_x (B(t_1 ) ≤ x_1 , B(t_2 ) ≤ x_2 , . . . , B(t_n ) ≤ x_n ) = \\ \int_{-\infty}^{x_1}p_{t_1}(x,y_1)d{y_1} \int_{-\infty}^{x_2}p_{{t_2}-{t_1}}(y_1,y_2)d{y_2}…\int_{-\infty}^{x_n}p_{{t_n}-{t_{n-1}}}(y_{n-1},y_n)d{y_n}\end{split}\ \ \ \ \ (3)$

It is easy to see that the one-dimensional distributions of Brownian motion satisfy $ {P_0 (B(t) ∈ A) = P_x (B(t) ∈ x + A)}$, where A is an interval on the line. All finite-dimensional distributions of $ {B_x (t)}$ and $ {x + B_0 (t)}$ are the same. Thus $ {B_x (t) − x}$ is Brownian motion started at $ {0}$, and $ {B_0 (t) + x}$ is Brownian motion started at x, in other words

$ \displaystyle B^x (t) = x + B^0(t)\ \ \ \ \ (4)$

This property is called the space homogeneous property of Brownian Motion. Generally, a stochastic process is called space-homogeneous if its finite- dimensional distributions do not change with a shift in space, namely if

$ {P(X(t_1 ) ≤ x_1 , X(t_2 ) ≤ x_2 , . . . X(t_n ) ≤ x_n |X(0) = 0) \\= P(X(t_1 ) ≤ x_1 + x, X(t_2 ) ≤ x_2 + x, . . . X(t_n ) ≤ x_n + x|X(0) = x)}$.

On the other hand, Brownian motion also bears the time homogeneous property as below.

$ {P(B(t) ≤ y |B(s) = x) = P(B(t-s) ≤ y|B(0) = x)}$

Actually, all finite-dimensional distributions of Brownian motion are time homogeneous.

Four realizations of Brownian motion $ {B = B(t)}$ started at 0 are exhibited in Figure 1. Although it is a process governed by the pure chance with zero mean, it has regions where motion looks like it has “trends”.

Figure 1: Four realizations or paths of Brownian motion B(t).

The covariance function of the process X(t) is defined by

$ \displaystyle \begin{split}\gamma(s, t) = Cov (X(t), X(s)) \\= E (X(t) − EX(t)) (X(s) − EX(s)) \\= E (X(t)X(s)) − EX(t)EX(s)\end{split}\ \ \ \ \ (5)$

Theorem: A Brownian motion is a Gaussian process with zero mean function, and covariance function $ {min(t, s)}$. Conversely, a Gaussian process with zero mean function, and covariance function $ {min(t, s)}$ is a Brownian motion.

Proof: Since the mean of the Brownian motion is zero,

$ {\gamma(s, t) = Cov (B(t), B(s)) = E (B(t)B(s))}$.

If ${t < s}$ then ${B(s)= B(t) + B(s) − B(t)}$, and

$E(B(t), B(s)) = EB^2(t) + E(B(t)(B(s) − B(t)))= EB^2(t) = t$,

where we used independence of increments property. Similarly if $t > s$, $E(B(t)B(s)) = min(t, s)$.

To show the converse, let $t$ be arbitrary and $s ≥ 0$. $X(t)$ is a Gaussian process, thus the joint distribution of $X(t), X(t + s)$ is a bivariate Normal, and by conditions has zero mean. Therefore the vector $\displaystyle (X(t), X(t+s)−X(t)$ is also bivariate Normal. The variables $X(t)$ and $X(t + s) − X(t)$ are uncorrelated, using that $Cov(X(t), X(t + s)) = min(t, s)$, $Cov(X(t), X(t+s)−X(t)) = Cov(X(t), X(t+s))−Cov(X(t), X(t)) = t−t = 0$.

A property of the multivariate Normal distribution implies that these variables are independent. Thus the increment $X(t + s) − X(t)$ is independent of $X(t)$ and has $N (0, s)$ distribution. Therefore it is a Brownian motion.

To determine the distribution of:

$ \displaystyle \int_{0}^{1} B(t) dt$

We can get this by taking limit of distribution of approximating sums

$ \displaystyle \sum B(t_{i})\Delta_{i}$, where $ {\Delta_{i} = t_{i+1} – t_{i}}$

If, for example, $ {t_{i} = i/n}$, then for $ {n=4}$ the approximating sum is $\displaystyle {1/4 (B(1/4) + B(1/2) + B(3/4) + B(1))}$. The distribution of above expression was found to be normal variable with mean 0 and variance 15/32. Similarly, distribution of all of the approximating sums is Normal with zero mean. Also, it can be shown that the limit of Gaussian distributions is a Gaussian distribution. Thus, the mean of required distribution is zero. Its variance is calculated as follows:

$ \displaystyle Var(\int_{0}^{1}B(t)dt) = Cov(\int_{0}^{1} B(t)dt, \int_{0}^{1}B(s)ds)$

$ \displaystyle =E(\int_{0}^{1} B(t)dt, \int_{0}^{1}B(s)ds)$

$ \displaystyle =\int_{0}^{1} \int_{0}^{1}E(B(t) B(s))dt ds$

$ \displaystyle =\int_{0}^{1} \int_{0}^{1}Cov(B(t) B(s))dt ds$

$ \displaystyle =\int_{0}^{1} \int_{0}^{1}min(t,s)dt ds$

$ \displaystyle =1/3$

The quadratic variation of Brownian motion Q(t) is defined as:

$ \displaystyle Q(t)=Q([0,t])=lim\sum_{i=1}^{n}|B(t_i^{n})-B(t_{i-1}^{n})|^{2}$

where the limit is taken over all shrinking partitions of [0, t], with

$ \displaystyle \delta_n = max_i (t_{i+1}^{n}-t_i^n )$

becomes 0 as n approaches $ {\infty}$ .

Let the sequence of partitions be such that interval is divide into two and the sub-interval also divided to two and so on.

Let $ {T_n = \sum_{i=1}^{n}|B(t_i^{n})-B(t_{i-1}^{n})|^{2}}$

$ \displaystyle E(T_n) = E(\sum_{i=1}^{n}|B(t_i^{n})-B(t_{i-1}^{n})|^{2})$

$ \displaystyle =\sum_{i=1}^n (t_i^{n} – t_{i-1}^{n})$

$ \displaystyle = t-0 = t$

Similarly,

$ \displaystyle Var(T_n) = Var(\sum_{i}|B(t_i^{n})-B(t_{i-1}^{n})|^{2})$

$ \displaystyle = \sum_{i}Var(B(t_i^{n})-B(t_{i-1}^{n}))^{2}$

$ \displaystyle = \sum_{i}3(t_i^{n} – t_{i-1}^n)^{2}$

(since fourth moment of N(0,$ {\sigma^2}$) is $ {3\sigma^4}$)

$ \displaystyle \leq \sum_{i}3max(t_i^{n} – t_{i-1}^n)t$

$ \displaystyle =3t\delta_n$

Hence, $ {\sum_{i=1}^n Var(T_n) \leq \infty}$. Using monotone convergence theorem, we find $ {E(\sum_{i=1}^n (T_n-E(T_n))^{2} \leq \infty}$. This implies that the series inside the expectation converges almost surely. Hence its terms converge to zero, i.e. $ {T_n-E(T_n)}$ becomes 0 almost surely, consequently ${T_n}$ tends to t almost surely.

B(t)’s as functions of t have the following properties. Almost every sample path B(t), $ {0 \leq t \leq T}$

1. is a continuous function of t;

2. is not monotone in any interval, no matter how small the interval is;

3. is not differentiable at any point;

4. has infinite variation on any interval, no matter how small it is;

5. has quadratic variation on [0, t] equal to t, for any t.

Properties 1 and 3 of Brownian motion paths state that although any B(t) is a continuous function of t, it has increments $ {\Delta B(t)}$, over a time interval $ {\Delta t}$, much larger than $ {\Delta t}$, as $ {\Delta t}$ tends to 0.

For property 4, a continuous function with finite variation has 0 quadratic variation.

For property 2, a monotone function has finite variation.

A stochastic process {X(t), $ {t \geq 0}$} is a martingale if

1) for any t it is integrable, $ {E|X(t)| < \infty}$,

2) for any $ {s > 0}$,

$ \displaystyle E(X(t + s)|F(t)) = X(t)$

where the F(t) is the information up-till time t and equality holds almost surely. In other words, if we have the values up-till time t and $ {X(t)=x}$, then expected value of future values is x.

Let B(t) be any Brownian motion. Then

1. B(t) is a martingale.

2. $ {B(t)^2-t}$ is a martingale.

3. For any u, $ {e^{uB(t)-(u^2/2)t}}$ is a martingale.

Proof of the above properties:

Note: The key idea to establish the martingale property of Brownian motion is using the independent increment property, which implies

$\displaystyle E(g(B_{t+s}-B_t)|F_t)=E(g(B_{t+s}-B_t))$. Here take the proof of second martingale above as an example.

$ \displaystyle B(t+s)^2=(B(t)+B(t + s)-B(t))^2$

$ \displaystyle =B(t)^2+2B(t)(B(t+s)- B(t))+(B(t+s)-B(t))^2$

Now, $ {E(B(t+s)^2|F(t)))}$

$ \displaystyle =B(t)^2+2E(B(t)(B(t + s)- B(t))|F(t))+E(B(t+s)-B(t))^2|F(t))$

$ \displaystyle =B(t)^2+s$

Subtracting (t+s), we get $ {B(t)^2-t}$ is a martingale.

A random time $T$ is called a stopping time for $B(t)$, where $t\ge 0$, if for any $t$ it is possible to decide whether $T$ has occurred or not by observing $B(s)$ for $0\le s\le t$. More rigorously, for any $t$ the sets $\{T\le t\}\in F_t$, the $\sigma$-algebra generated by $B(s)$ for $0\le s\le t$.

– Any non-random time $T$ is a stopping time. $\{T\le t\}$ is either $\emptyset$ or $\Omega$.

– The first time B(t) takes value 1, $\min_t B(t)=1$, is stopping time.

$ \displaystyle {\{ T\le t\}= \{B(u)<1}$ , for all $ u\le t\}\in F_t $.

– The time Brownian motion reaches maximum in $[0,1]$ is not stopping time.

– The time Brownian motion last crosses 0 is not stopping time.

Brownian motion has Markov property:

$\displaystyle P(B(t+s)\le y|F_t)=P(B(t+s)\le y|B(t))$

Proof: use exponential Martingale:

$ E(\exp(u\cdot B(t+s))|F_t) = \exp(u\cdot B(t))\cdot \mathbf E(\exp(u(B(t+s)-B(t)))|F_t)

= \exp(u\cdot B(t))\cdot \mathbf E(\exp(u(B(t+s)-B(t))))$

Brownian motion has strong Markov property: for any finite stopping time $T$, $\displaystyle P(B_{T+t}|F_T)=P(B_{T+t}|B_T)$.

The strong Markov property is similar to the Markov property, except that in the definition a fixed time t is replaced by a stopping time T. The proof of the strong Markov property can be done using the exponential martingale and the Optimal Stopping Theorem.

Hitting time $\tau_x$ denotes the first time $B(t)$ hits level $x$: $\tau_x=\inf\{t>0:B(t)=x\}$

Exit time $\tau$ denotes the first time $B(t)$ exit an interval $(a,b)$: $\displaystyle \tau=\min(\tau_a,\tau_b)$

Claim: Consider a Brownian motion process $B_x(t)$ starting from $a<x<b$ and an exit time $\tau=\min(\tau_a,\tau_b)$. We have $P_x(\tau<\infty)=1$ and $E_x\tau<\infty$

Corollary: $P_a(\tau_b<\infty)=1$ and $P_a(\tau_a<\infty)=1$.

Let the maximum and minimun of Brownian motion $B_0(s)$ starting from 0 on $[0,t]$ be $M(t)=\max_{0\le s\le t} B(s)$ and $m(t)=\min_{0\le s\le t} B(s)$.

Let $\tau_x=\inf\{t>0: B(t)=x\}$. Then for any $x>0$, we have $\displaystyle P_0(M(t)\ge x)=2\cdot P_0(B(t)\ge x)=2(1-\Phi({x\over\sqrt t}))$.

Proof: Note the equivalence of two events: $\{M(t)\ge x\}=\{\tau_x\le t\}$, and that $P(B(t)\ge x)=P(B(\tau_x+(t-\tau_x))-B(\tau_x)\ge 0, \tau_x\le t)=P(\hat B(t-\tau_x)>0)\cdot P(M(t)\ge x)$ by strong Markov property.

Corollary: $P(B(t)\le 0, \forall 0\le t\le 1)=0$

The probability density of $\tau_x$ is inverse gamma density $\displaystyle f_{\tau_x}={|x|\over\sqrt{2\pi}}t^{-{1\over2}}\exp(-{x^2\over 2t})$, and $E \tau_x=\infty$. Hence although $x$ is visited with probability 1, the expected time for $x$ to be visited is infinite.

Proof: Note the equivalence of two events: $\{M(t)\ge x\}=\{\tau_x\le t\}$. Hence $P(\tau_x\le t)=P(M(t)\ge x)=2P(B(t)\ge x)=\dots$. Differentiate over $t$.

Let $T$ be a stopping time. Define $\hat B(t)=B(t)$ for $t\le T$ and $\hat B(t)=2B(T)-B(t)$ for $t\ge T$. Then $\hat B$ is also a Brownian motion. Below, Figure 2 shows this principle.

Figure 2: Reflection Principle of Brownian Motion

Fima C. Klebaner (2005). Introduction to Stochastic Calculus with Applications. Imperial College Press.

Peter Tankov and Rama Cont (2003). Financial Modelling with Jump Processes. Chapman & Hall/CRC Financial Mathematics Series. On Course web site

]]>Let us take an example. Let’s consider a time period [0, T ] partitioned into n intervals of equal length $\Delta t = T/n$ with endpoints $ t_{k}= k\Delta t,$ k = 0, …, n. These are the times at which the market is open and trade in shares can take place. An investor buys $q(t_{0})$ shares at time 0 at a price of $S(t_{0})$ each. At time $t_{1}$, market trading establishes a new share price $S(t_{1})$. Once this price has been revealed the investor can change the quantity of shares held, from $q(t_{0})$ to $q(t_{1})$. The same at the subsequent times $t_{2}$ through $t_{n−1}$. At time $t_{n}$ the entire portfolio is liquidated as soon as share price $S(t_{n})$ becomes known. A portfolio of $q(t_{k})$ shares is held from just after the share price $S(t_{k})$ has been revealed, to when the share price $S(t_{k+1})$ at time $t_{k+1}$ becomes known. This portfolio is worth $q(t_{k} )S(t_{k})$ at time $t_{k}$ and $q(t_{k})S(t_{k+1})$ at time $t_{k+1}$. So from time $t_{k}$ to time $t_{k+1}$, with the change in share price $S(t_{k+1})$ − $S(t_{k})$ denoted by $S_{k}$, the portfolio value changes by $q(t_{k}) S_{k}$. The gain over all trades can then be expressed as $I_n = \sum_{k=0}^{n-1} q(t_{k}) \Delta S_{k}$. This expression is an example of a discrete stochastic integral.

At each trading time, the quantity of shares to be held until the next trading opportunity arises, is decided by the investor. Some investors will buy shares and others will sell as a trade requires two parties. The factors which an investor takes into account in determining the shareholdings are unknown, so q is a discrete-time random process. Thus the stochastic integral is constructed from the two random processes q and S:

$$\sum_{k=0}^{n-1} q(t_k )[ \mu S(t_k) \Delta t + \sigma_S(t_k ) \Delta B(t_k )]

= I_n$$

Increasing n makes the time step t smaller and discrete-time trading approaches continuous-time trading. The first term is a summation with respect to time step $\Delta t$. The second term is a summation with respect to Brownian motion increment [B($t_{k+1}$) − B($t_{k}$ )].

Let’s take time period [0, T] partitioned into n intervals of equal length t = T/n, with endpoints $t_{k}$ = $k\Delta t$, k = 0, . . ., n. The simplest integrand f is a step-function whose values are non-random, so the only random quantity is the Brownian motion integrator. Step-function f can be written with indicator notation as

$$f = f(t_{0}) 1_{[t_0,t_1)} + · · · + f (t_{k} ) 1_{[t_k ,t_{k+1})} + · · · + f (t_{n−1}) 1_{[t_{n−1},t_n )}$$

where $1_{[tk ,tk+1)}$ denotes the indicator function which has value 1 on the interval shown in the subscript, and 0 elsewhere. Define the discrete stochastic integral $I_{n}( f )$:

$$I_n( f ) = \sum_{k=0}^{n-1}f(t_{k} )[B(t_{k+1}) − B(t_{k} )]$$

Therefore, integral of a non-random process over Brownian motion is the limit of the integrals of non-random simple processes approaching this non-random process.

Let f be at random level $f(t_{k}, ω)$ for $t_{k}$ < t < $t_{k+1}.$ When the history of the Brownian motion process becomes known progressively at each time $t_{k}$, it must be possible to determine $f (t_{k})$ from this history alone. The current terminology is that $f (t_{k})$ must be adapted to the filtration ($t_{k}$ ).

**Properties:**

**Linearity:**$\int_0^T (\alpha X(t) +\beta Y(t)) d B(t) = \alpha \int_0^T X(t) d B(t) + \beta \int_0^T Y(t) d B(t)$ for constant $\alpha$ and $\beta$ and simple adapted processes $X(t)$ and $Y(t)$

$\int_0^T 1_{(a,b]} d B(t) = B(b)-B(a)$, and

$\int_0^T 1_{(a,b]} X(t) d B(t) = \int_a^b X(t) d B(t)$-
**Zero mean property**

$\mathbf E\int_0^T X(t) d B(t) = 0$ **Isometry property:**

$\mathbf E\left(\int_0^T X(t) d B(t)\right)^2=\int_0^T \mathbf E(X^2(t)) dt$

After looking at the specific case of simple adaptive process, we want to find the Ito integral for general integrands i.e. any regular adapted process. So let’s assume $X^n(t)$ to be a sequence of simple processes convergent in probability to the process $X(t)$. $$ X^{(n)}(t)\rightarrow X(t)$$ Then, under some conditions, the sequence of their integrals $\int_0^TX^n(t)dB(t)$ also convergence in probability. Hence, if, $$\int_0^TX^n(t)dB(t)\rightarrow J^n$$

Then $$\int_0^TX(t)dB(t)\rightarrow J$$

**For Example:**

Find $\int_0^T B(t)dB(t)$:

let’s assume $0=t_0^n<t_1^n<t_2^n< … <t_n^n = T$ be a partition of [0, T] and let- $$X^n(t) = \sum_{i=0}^{n-1} B(t_i^n)I_{(t_i^n, t_{i+1}^n)}(t)$$

Also $\delta_n = max_i(t_{i+1}^n – t_i^n)\rightarrow 0$ then $X_n(t)\rightarrow B(t)$ is approximately the continuity of the Brownian paths.

Hence, $$\int_0^TX^n(t)dB(t) = \sum_{i=0}^{n-1} B(t_i^n)(B(t_{i+1}^n)- B(t_{i+1}^n))$$

where, $$B(t_i^n)(B(t_{i+1}^n)- B(t_{i+1}^n))=1/2(B^2(t_{i+1}^n) – B^2(t_{i}^n) – (B(t_{i+1}^n) – B(t_{i}^n))^2)$$

Therefore, $$\int_0^TX^n(t)dB(t)$$

$$ = \sum_{i=0}^{n-1} 1/2(B^2(t_{i+1}^n) – B^2(t_{i}^n) – (B(t_{i+1}^n) – B(t_{i}^n))^2)$$

$$=1/2(B^2(T) – 1/2(B^2(0)) -1/2 \sum_{t=0}^{n-1}(B(t_{i+1}^n) – B(t_i^n))^2$$

Here, the last summation part converges to the quadratic variation T of Brownian motion on [0, T] in probability.

Therefore $\int_0^T B(t)dB(t) = lim \int_0^T X^n(t)dB(t) = 1/2B^2(T) – 1/2T$

**Properties:**

Any regular adaptive process has the same properties as we already discussed for the simple adaptive process.

**Linearity:**

$\int_0^T (\alpha X(t) +\beta Y(t)) d B(t) = \alpha \int_0^T X(t) d B(t) + \beta \int_0^T Y(t) d B(t)$

for constant $\alpha$ and $\beta$- $\int_0^T 1_{(a,b]} X(t) d B(t) = \int_a^b X(t) d B(t)$
**Zero mean property:**$\mathbf E\int_0^T X(t) d B(t) = 0$**Isometry property**: $\mathbf E\left(\int_0^T X(t) d B(t)\right)^2=\int_0^T \mathbf E(X^2(t)) dt$

$$Y(t)=\int_0^TX(s)dB(s)$$

(Equation A)

where,

$$\int_0^TX^2(s)ds<\infty$$

(Equation B)

In simple terms, ito intergal is a way of integrating a stochastic process variable with respect to another stochastic process that result into a new stochastic process. Here, X(s) is a square-integrable process (hence the equation B) that is adapted to the filteration generated by B(s). Standard calculous doesn’t apply to such functions since the function X(S) is not differentiable at any point and has infinite variateion over every time interval.

Ito integral plays crucial part in mathemetical finance to evaluate strategies to make descisions over time variable stochastic processes or Brownian processes such as stoch prices.

**Properties:**

**Martingale Property:**Probabilities of current state doesn’t depend on any knowledge of past events, given that it’s square is inegral is finite. I.e. Expected value of the state Y at time t doesn’t depend on the given $F_s$. And hence it is a stochastic variable Y(s) only.

$$E(Y(t)|F_s) = Y(s)$$**Quadratic Variation:**The quadratic variation of an Ito processes can be given by $\int_0^t X^2 (s)ds$

We know that B(T) is a Brownian motion from 0 to T and since we know the first order derivative and second order derivate of F then we can find out the value of Brownian motion at time T as in the example below.

Example: \[f(B(t)) = f(0)+\int_0^t f'(B(s)) d B(s) + {1\over 2}\int_0^t f”(B(s)) d s\]

where: \[\exp(B(t)) = 1+\int_0^t \exp(B(s)) d B(s) + {1\over 2}\int_0^t \exp(B(s)) d s\]

Proof of the above example is:

\[f(B(t)) = f(0) + \sum_{i=0}^{n-1} \left(f(B(t^n_{i+1}))-f(B(t^n_{i}))\right)\]

After taking the first order derivative,

\[\hphantom{f(B(t)) } = f(0) + \sum_{i=0}^{n-1} f'(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right) + {1\over 2}\sum_{i=0}^{n-1} f”(\theta^n_{i})\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\] For \[\theta^n_{i}\in \left(B(t^n_{i}),B(t^n_{i+1})\right)\]

\[\sum_{i=0}^{n-1} f'(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right) \to \int_0^t f'(B(s)) d B(s)\]

After taking the second order derivative and as $n\to\infty$ we have the below equations

\[\sum_{i=0}^{n-1} f”(\theta^n_{i})\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\to \sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\]

In probability as $n\to\infty$

then

\[\sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right)^2 \to \sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(t^n_{i+1}-t^n_{i}\right)\]

Hence

\[\sum_{i=0}^{n-1} f”(B(t^n_{i}))\left(B(t^n_{i+1})-B(t^n_{i})\right)^2\to \int_0^t f”(B(s)) d s\]

We know B(t) is a Brownian motion from $0<t<T$. An Ito-process Y(t) from 0≤t≤T is given by

\[Y(t) = Y(0) + \int_0^t \mu(s) d s + \int_0^t \sigma(s) d B(s)\] or

\[d Y(t) = \mu(t) d t + \sigma(t) d B(t)\]

where $0\le t\le T$, $Y(0)$ is $F_0$-measurable, $\mu(t)$ and $\sigma(t)$ are $F_t$-adapted, $\int_0^T|\mu(t)| d t<\infty$ and $\int_0^T\sigma^2(t) d t <\infty$.

Example: Taylor’s formula written at the second order for Brownian motion reads as:

Chain rule: $d(f(B(t)))=f'(B(t)) d B(t) + {1\over 2}f”(B(t)) d t$, for $f$ with continuous 2nd derivative:

$d \exp(B(t)) = \exp(B(t)) d B(t) + {1\over 2} \exp(B(t)) d t$

$d B^2(t) = 2 B(t) d B(t) + dt$

Let X be an Ito process with stochastic differential then

$d X(t) = \mu(t) d t + \sigma(t) d B(t)$ for 0 ≤ t ≤ T.

Let ($f\in C^2$) then the stochastic differential of the process f exists and can be written as

$$\begin{eqnarray*}

df(X(t)) &=& f'(X(t))dX(t)+\frac{1}{2}f”(X(t))d[X,X](t)\\

&=& f'(X(t))dX(t)+\frac{1}{2}f”(X(t))\sigma^{2}(t)dt\\

&=& \left(f'(X(t))\mu(t)+\frac{1}{2}f”(X(t))\sigma^{2}(t)\right)dt+f'(X(t))\sigma(t)dB(t)

\end{eqnarray*}$$

If x is a non- random process satisfying $\int_0^T X^2(s) d s <\infty$ and Y is a Gaussian process where $Y(t)=\int_0^t X(s) d B(s)$ Then it hass a zero mean and the covariance can be written as below:

$\mbox{Cov}(Y(t), Y(t+u)) = \int_0^t X^2(s) d s$, $t,u\ge 0$
]]>
__The Theory of Speculation__

The focus of Bachelier’s work (The Theory of Speculation) was to establish the probability law for the price fluctuations that the market admits at this instant. Indeed, while the market does not foresee fluctuations, it considers which of them are more or less probable, and this probability can be evaluated mathematically.

There are innumerable influences on the movements of stock exchange.

Stock Exchange acts upon itself, implies: its current movement is a function not only of earlier fluctuations, but also of the present market position.

The determination of these fluctuations is subject to an infinite number of factors: it is therefore impossible to expect a mathematically exact forecast.

*(An excellent translation of the original paper: http://www.radio.goldseek.com/bachelier-thesis-theory-of-speculation-en.pdf)*

__ __

__A few important terms defined__

**Spot Price:** The current price at which a particular security can be bought or sold at a specified time and place. (A securities futures price is the expected value of the security.)

**Futures Contract:** A contractual agreement to buy or sell a particular commodity or financial instrument at a pre-determined price in the future.

**Option:** offers the buyer the right, but not the obligation, to buy (call) or sell (put) a security or other financial asset at an agreed-upon price (the strike price) on a specific date (exercise date).

**Mathematical advantage**: Mathematical expectation does not provide a coefficient representing, in some sense, the intrinsic value of the game, it only logically tells whether a game will be profitable or not, Hence the concept of Mathematical advantage.

We can understand mathematical advantage from the example of a gambler’s odds as explained in the paper by Louis Bachelier: Define the mathematical advantage of a gambler as the ratio of his positive expectation and the arithmetic sum of his positive and negative expectations. Mathematical advantage varies like probability from zero to one, it is equal to 1/2 when the game is fair.

Mathematical expectation of gain / (mathematical expectation of gain + absolution value of mathematical expectation of loss).

The mathematical expectation of the speculator is zero

__The General Form of probability curve__

The probability that a price y be quoted at a given epoch is a function of y.

The price considered by the market as the most likely is the current true price.

Prices can vary between −x_{o} and +∞: x_{o} being the current absolute price.

It will be assumed that it can vary between −∞ and +∞. The probability of a spread greater than x_{o} being considered a priori entirely negligible.

The probability of a deviation from the true price is independent of the absolute level of this price, and that the probability curve is symmetrical with respect to the true price.

Only relative prices will matter because the origin of the coordinates will always correspond to the current true price.

__Probability law of Financial Product__

**Method 1:**

- p
_{x,t}dx: probability for price to be in the interval [x,x+dx) at time t

p_{x,t}dx denotes the probability that, at epoch t, the price is to be found in the elementary interval x, x + dx.

The desired probability is therefore

\[

p_{x,t_1}p_{z-x,t_2} dx dz

\]

- Markov property: p
_{z,t1+t2}dz =_{−∞}∫^{∞}p_{x,t1}p_{z−x,t2}dxdz

Because at epoch t_{1}, the price could be located in any the intervals dx between −∞ and +∞, so this is the probability of the price z being quoted at epoch t_{1}+ t_{2}. - Homogeneous Gaussian process:

\[

p_{x,t} = {1\over 2\pi k\sqrt t}\exp\left(-{1\over 4\pi k^2 t}\right)

\]

Where \[k=\int_0^\infty p_{x,1} dx\] is the gain at time 1.

**Method 2: Alternate derivation of the Probability Law**

- conduct m Bernoulli trials in [0,t), with success rate p

Suppose that two complementary events A and B have the respective probabilities p and q = 1 – p. The probability that, on m occasions, it would produce α equal to A and m − α equal to B, is given by the expression.

\[{P}(\alpha; m,p) = {m\choose \alpha}p^\alpha q^{m-\alpha}\]

- mathematical expectation of gain

\[\sum_{h\ge0}\mbox{P}(mp+h;m,p)\]

\[ = pq\cdot\left(\partial_{p}\sum_{h\ge0}\mbox{P}(mp+h;m,p)-\partial_{q}\sum_{h\ge0}\mbox{P}(mp+h;m,p)\right) \]

\[= \frac{m!}{(mp)!(mq)!}p^{mp}q^{mq}mpq\]

The quantity h is called the spread.

Let us seek for the mathematical expectation for a gambler who would receive a sum equal to the spread whenever this spread be positive.

We have just seen that the probability of a spread h is the term from the expansion of (p + q)^{m} in which the exponent of p is mp + h, and that of q is mqh.

To obtain the mathematical expectation corresponding to this term, it is necessary to multiply this probability by h. Now,

h = q(mp + h) − p(mqh)

mp + h and mqh being the exponents of p and q in a term from (p + q)^{m}.

- Apply Stirling’s formula

\[n!=e^{-n}n^n\sqrt{2\pi n}\]

\[{E}={dh\over \sqrt{2\pi m p q}}\exp\left(-{h^2\over 2 m p q}\right)\]

]]>**Rank-Size Population Distributions And Dynamics of Cities
**

We analyse how the shift in population of a given city happens with time given their ranks. We try to rank them by their population size.\\\\

We define some parameters to represent the population and their ranks. \\

$P_{1}(t),…..,P_{N}(t)$: populations of N cities at time t

$r_{1},….,r_{N}$: population ranks of cities in decreasing

$P_{i,r_{i}=1}≥P_{j,r_{j}=2}≥P_{m,r_{m}=N}$: population sorted in decreasing order

$\Gamma(t+1)=\sum_{i}|p_{i,r_{i}(t+1)}(t+1) – p_{i,r_{i}(t)}(t)|$: percentage shift in population, where $pi,r=P_{i,r}/\sum_{i}P_{i,r}$

$\wedge(t+1)=\sum_{i}|r_{i}(t+1)-r_{i}(t)|/N$ percentage shift in rank order

insert image 1

insert image 2

We plotted the $\log(rank)$ vs $\log(population)$ for 458 standardized administrative areas from 1901 to 1991. We can see that it shows a power law distribution between rank and population. The curve is very stable as we can see from the plot. As 1901 and 1901 the curve is similar with same coefficient even with population of the cities changes.\\

How can we explain this above pattern based on microscopic agent behaviour and interaction. If we can explain then we can define the model for population development:\\

insert image 3

We will try to explain one of the multiple interpretation.

let increment in population is some random variable with Gaussian distribution.\\

If we have the initial state (population at time t = 0, $P_{i_{0}}$) and increment of population at time t = 0 is $\epsilon_{0}$ and for city 1 increment is $\epsilon_{1}$ and so on.

$\log P_{i}(t) = \log P_{i}(0) + \sum_{\tau=0}^t \epsilon_{i}(\tau)$

Also assume mean is $\mu$ and variance is $\sigma^2$, than we get the log normal distribution. (meaning logarithm of the population is normal distribution)

Now simulate the same number of cities, 458 cities and we start from a uniform population distribution. In each city we simulate the population growth according to Gaussian distribution. We sample the scaling factor of all of the 458 cities and we simulate from 991 AD to 1991 and what we find out is indeed that we have some kind of nice straight line of the log population shear versus log ranks. It is one feasible interpretation, hypothesis about city development.

**Explaining city growth with Ants Model**

How can we explain the above model (growth in population of cities)?

insert image 4

We use this agents based model to explain the growth of cities.

In the picture we have nest in the center, and three food sources. In starting ants move around randomly until they reach the food source. They carry the food and leave the chemical trails while returning for the other ants. So that they can find the food source easily. We can see the simulation of this model in the NetLogo tool.\\

We can identify many things with this model. First is that nearest food resource will be probabilistically explored first and furthest will be explored last.

Second, whenever ant find it and they will leave trails. And It will have large concentration of the chemicals. So we can relate this ant model with the behaviour of the people.

**Formulation of Ants Model**

Following random variables represent the Ants model:

Position of any ant at time t : $A(x_i(t),y_i(t))$

Home of ant i: $A(X_i,Y_i)$

Position of resource k : $R(X_k, Y_k)$

The chemical concentration of scent at point (x,y) : $\psi(x,y)$

There are few things to note in the above representation. ‘Home of ant’ does not vary with time. The chemical concentration of scent is a function of time.

**Process**

When the ant is moving back to home with food, it leaves the trail of chemical scent. This is represented by the following equation:

\begin{equation}

\psi(x_i(t+1),y_i(t+1)) = \psi(x_i(t),y_i(t)) + \phi

\end{equation}

This equation says that concentration of chemical scent depends on the same at the postion in previous time step, with some added noise $\phi$.

Generally, we can represent the ants motion as the following:

\begin{equation}

x_i(t+1) = x_i(t) + \Delta x_i(t) + \epsilon_x(t)

\end{equation}

\begin{equation}

y_i(t+1) = y_i(t) + \Delta y_i(t) + \epsilon_y(t)

\end{equation}

That is, the position of ant depends on the gradient of previous time step plus some added noise, $\epsilon$.

The gradient, $\Delta x$ or $\Delta y$ depends on the two cases:

When the ant is seeking food: We know that the ant is trying at random (given by $\epsilon(t)$, but following any chemical trail if it finds. This is intuitively represented by the equation:

\begin{equation}

(\Delta x_i(t), \Delta y_i(t)) = c \cdot \nabla \psi(x_i(t),y_i(t)) + \epsilon(t)

\end{equation}

When the ant is moving food back: We know that ant follows the path that is dependent on the ant’s home. This is given by the equation:

\begin{equation}

(\Delta x_i(t), \Delta y_i(t)) = (f(X_i),f(Y_i))

\end{equation}

This formulation let us think the ant model probabilistically.

**Simulation and Observation of Ants Model**

First we simulate a \textbf{simplified ants model, with one home (at origin) and one food center} (represented by a square box). In the image below we can observe three time stages. Intitially (at t=0), we observe that the ants move in random directions. There are no definite permanent tracks yet.\par

As we skip to time, t=200, we observe that ants have found the food and the track has started to emerge. But the ants are still randomly distributed in the space.\par

Finally, at time, t=2000, we observe that the track is well defined and the ants have started to cluster around this track.\linebreak

Insert image 5

Next we simulate an \textbf{ant model with random homes and multiple resources}. In the below image, we can clearly observe one of the main properties of cities, that is roads. We can observe star like structures (roads) between homes and resources after sufficient number of iterations.

Insert image 6

Next let us make this formulation more interesting and complex. We add following features instead of resources: ‘Industry’ ($I_{xy}$) and ‘Service’ ($S_{xy}$). We also dynamically add population ($P_{xy}$) to the model. ‘Service’ is dependent on ‘Industry’. The population is dependent of ‘Service’ and ‘Industry’. We can model any kind of interaction between Industry, Service and Population.

The process of the model is defined by:

The Position at any time, depends on previous time step population, the diffusion of population ($\nabla^2P_{xy}(t)$) (laplacian) at previous time step, the interaction of industry and service ($o_{xy}^{(k)}(t)$) and random noise ($\epsilon$).

\begin{equation}

P_{xy}(t+1) = P_{xy}(t) + \theta^{(P)}\nabla^2P_{xy}(t) + \Sigma_k o_{xy}^{(k)}(t) + \epsilon_{xy}^{(P)}(t)

\end{equation}

where, $\theta$ is diffusion constant

Similarly, we can represent the denisty of Indurtrial resouce and Service:

\begin{equation}

I_{xy}(t+1) = I_{xy}(t) + \theta^{(I)}\nabla^2 I_{xy}(t) + \Sigma_k o_{xy}^{(k)}(t) + \epsilon_{xy}^{(I)}(t)

\end{equation}

\begin{equation}

S_{xy}(t+1) = S_{xy}(t) + \theta^{(S)}\nabla^2 S_{xy}(t) + \Sigma_k o_{xy}^{(k)}(t) + \epsilon_{xy}^{(S)}(t)

\end{equation}

**Observations**

In the image below we can observe the simulation. Intially, we see the formulation of nuclear cities (at t=100). We observe cluster of population which represent cities. The log-size of the ‘cities’ is linear with log-rank, which we had seen in the british cities data from 1901 to 1991. We also observe the formation of roads.

Insert image 7

**SandPile model**

Let us perform the following experiment:

Drop sand uniformally onto a table. Gradually, sandpile height increases, and the sand starts to fall in surrounding areas. The key observation is that sand fall can trigger more sand fall and an ‘avalanche’ will occur.

In Sandpile Model we try to model the aboce experiment. Model Representation:

\begin{equation}

X_{i,j} \geq 4: X_{i,j} = X_{i,j} – 4,

\end{equation}

\begin{equation}

X_{i \pm 1,j} = X_{i \pm 1,j} + 1,

\end{equation}

\begin{equation}

X_{i,j \pm 1} = X_{i,j \pm 1} + 1,

\end{equation}

where $X_{i,j}$ is the height of the sand pile.

NetLogo Demo:

Checkout:

http://netlogoweb.org/launch#http://netlogoweb.org/assets/modelslib/Sample%20Models/Chemistry%20&%20Physics/Sandpile.nlogo

You can play with it, and observe that small avalanch occur more often than large ones. We study Sandpile model as it is related to many real-life scenarios. For instance, the migration of the population from one city to another can be said to follow sand pile model. Similarly, we can observe it in forest fires, frequency and size of earthquakes, etc.

**Size of Avalanche based on SandPile model**

Here we are using the SandPile model to understand the working of avalanches.

Attempts to model avalanche behaviour date from the early 20th century, notably the work of Professor Lagotala in preparation for the 1924 Winter Olympics in Chamonix. His method was developed by A. Voellmy and popularised following the publication in 1955 of his Ueber die Zerstoerungskraft von Lawinen (On the Destructive Force of Avalanches). Voellmy used a simple empirical formula, treating an avalanche as a sliding block of snow moving with a drag force that was proportional to the square of the speed of its flow:

\begin{equation}

Force = 1/2 * \rho \nu^2

\end{equation}

Avalanche increases in the following manner having a relation to the previous state

\linebreak

\begin{equation}

A + \phi \rightarrow B

\end{equation}

\begin{equation}

B + \phi \rightarrow C

\end{equation}

\begin{equation}

C + \phi \rightarrow D

\end{equation}

\begin{equation}

D + \phi \rightarrow p_1 D + \phi, p_2 C + 2\phi, B + 3\phi, A + 4\phi

\end{equation}

The kinetic equations for this growth are:

\begin{equation}

\dot{n_A} = n_\phi (p_4 n_D – n_A)

\end{equation}

\begin{equation}

\dot{n_B} = n_\phi (p_3 n_D + n_A – n_B)

\end{equation}

\begin{equation}

\dot{n_C} = n_\phi (p_2 n_D + n_B – n_C)

\end{equation}

\begin{equation}

\dot{n_D} = n_\phi (p_1 n_D + n_C – n_D)

\end{equation}

In general,

\begin{equation}

\dot{n_\phi} = n_\phi (\bar{p} n_D – 1) + \bar{p}\nu\nabla^2 (n_\phi n_D) + \eta(r,t)

\end{equation}

where

\bar{p} = p_1 + 2p_2 + 3p_3 + 4p_4

\nu = diffusion coefficient

\eta = noise

Insert image 8

Avalanches in 2D sandpile model.The dark regions show the size

of avalanches triggered by a single addition.

Though at first glance, it seems that the number of topplings and the area both measure the number of affected sites, a single site may topple more than once in a single avalanche, and hence the two are truly different variables. The duration of the avalanche is defined as the number of updates one must perform on the system before all the sites become stable after the addition of one grain of sand.

Video An avalanche triggered which follows SandPile model

**Avalanche Equilibrium**

For the avalanche to reach its equilibrium we have the probability of the number of grains affected after t steps

Probability generating function:

\begin{equation}

G_t(\lambda_A,\lambda_B,\lambda_C,\lambda_D) = \sum_{N} (N_A,N_B,N_C,N_D) \lambda^{N_A}_A \lambda^{N_B}_B \lambda^{N_B}_B \lambda^{N_D}_D

\end{equation}

At time 1,

\begin{equation}

G_1 = (n_Dp_4)\lambda_A + (n_Dp_3 + n_A)\lambda_B + (n_Dp_2 + n_B)\lambda_C + (n_Dp_1 + n_C)\lambda_D

\end{equation} \linebreak

At time t+1, the recursive relationship with $G_t is

\begin{equation}

G_{t+1} = (n_A\lambda_B + n_B\lambda_C + n_C\lambda_D) (n_Dp_4)\lambda_A + n_D(p_1\lambda_DG_t + p_2\lambda_DG_t^2 + p_3\lambda_BG_t^3 + p_4\lambda_AG_t^4)

\end{equation}

\begin{equation}

G_\infty(\lambda) = \frac{1-\sqrt{a-4\lambda^2n_D(1-n_D)}}{2\lambda n_D}

\end{equation}

Example: Metroploitan Buffalo population growth (1750-1990)

Insert image 9

It has been discusseed and found that when a system diverges the various parameters of the Sandpile model exhibit a power law distribution for scenarios like avalanches or population. For additional information see this http://guava.physics.uiuc.edu/~nigel/courses/563/Essays_2012/PDF/banerjee.pdf

For such a distribution in case of population, we have the following parameters:

\begin{equation}

N(R) = (2R)^D,

\end{equation}

where $\textit{D}$ = fractal dimension of the city 1\leq\textit{D}\leq2

R = effective radius of city

Cumulative populaiton density:

\begin{equation}

P(R) = N(R)/A(R) = (2R)^{D-2}

\end{equation}

where A(R) = Area of square

Incremental Populaion

\begin{equation}

n(R) = \frac{dN(R)}{dR} = D2R^{D-1}

\end{equation}

Incremental density

\begin{equation}

\rho(R) = \frac{dn(R)}{dR} = D^2R^{D-2}

\end{equation}