### Kalman Smoothing and Parameter Estimation

We work backward here.

\begin{align*}

& P(x_{k}|y_{1,\dots,T})\\

= & \sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,T})P(x_{k+1}|y_{1,\dots,T})\\

= & \overset{P(x_{k}|x_{k+1}y_{1,\dots,T})=\frac{P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}{\sum_{x_{k}}P(y_{1,\dots k},x_{k})P(x_{k+1}|x_{k})P(y_{k+1,\dots,T}|x_{k+1})}=P(x_{k}|x_{k+1}y_{1,\dots,k})}{\sum_{x_{k+1}}P(x_{k}|x_{k+1}y_{1,\dots,k})P(x_{k+1}|y_{1,\dots,T})}\\

= & \sum_{x_{k+1}}\frac{P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}{\sum_{x_{k}}P(x_{k},y_{1,\dots,k})P(x_{k+1}|x_{k})}\cdot P(x_{k+1}|y_{1,\dots,T})

\end{align*}

Hence to estimate the distribution of $x_{k}$ from $y_{1,\dots,T}$ in Kalman smoothing, we first need to find the joint distribution $P(x_{k+1}|x_{k})P(x_{k}|y_{1,\dots,k})$ just after the prediction step to estimate the distribution of $x_{k+1}$ in Kalman filtering, condition the distribution of $x_{k}$ from predicted value of $x_{k+1}$, join with the distribution of $x_{k+1}$ in Kalman smoothing, and marginalize over the distribution of $x_{k+1}$ in Kalman smoothing:

\begin{align*}

\left(\begin{matrix}x_{k|k}\\

x_{k+1|k}

\end{matrix}\right) & \sim\mathcal{N}\left(\left(\begin{matrix}\hat{x}_{k|k}\\

F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}

\end{matrix}\right),\left(\begin{matrix}P_{k|k} & P_{k|k}F_{k+1}^{T}\\

F_{k+1}P_{k|k} & F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}

\end{matrix}\right)\right)\\

X_{k|k}|x_{k+1|k} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\

& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)\\

X_{k|k} & =\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(x_{k+1|k}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right)+\epsilon\\

& \mbox{where }\epsilon\sim\mathcal{N}(0,P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k})\\

X_{k|T} & \sim\mathcal{N}\left(\hat{x}_{k|k}+P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}\left(\hat{x}_{k+1|T}-F_{k+1}\hat{x}_{k|k}+B_{t+1}u_{t+1}\right),\right.\\

& \phantom{\mathcal{\sim N\bigg(}}\left.P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}P_{k|T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}+P_{k|k}-P_{k|k}F_{k+1}^{T}\left(F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}\right)^{-1}F_{k+1}P_{k|k}\right)

\end{align*}

Hence

- innovation in smoothing is $\delta=\left(\hat{x}_{k+1|T} -\hat{x}_{k+1|k}\right)$
- Covariance of innovation is $P_{k+1|k}=F_{k+1}P_{k|k}F_{k+1}^{T}+Q_{k+1}$
- Gain in smoothing is $G=P_{k|k}F_{k+1}^{T}P_{k+1|k}^{-1}$
- Updated (a posterior) state estimate $\hat{x}_{k|T}=\hat{x}_{k|k}+G\delta$
- Updated (a posterior) covariance matrix $P_{k|T}=P_{k|k}+G(P_{k+1|T}-P_{k+1|k})G^T$

### Parameter Learning

We will estimate $F$, $H$, $Q$ and $R$ from observations $z_1,\dots z_T$ about the following system

- $X_t = F X_{t-1} + w_t$, where $w_k\sim{\cal N}(0, Q)$ are i.i.d.
- $Z_t = H X_t + v_t$, where $v_k\sim{\cal N}(0, R)$ are i.i.d. and independent of $w_{t’}$ for all $t’$

Likelihood of $X_{1,\dots,T},Z_{1,\dots,T}$

\begin{align}

P(X_{1,\dots,T},Z_{1,\dots,T})

=&P(X_{1})\prod_{t=2}^{T}P(X_{t}|X_{t-1})\prod_{t=1}^{T}P(Z_{t}|X_{t})\\

=&\left(2\pi\right)^{-\frac{d_X}{2}}\left|\Sigma_{0}\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)\right)\\

&\prod_{t=2}^{T}\left(2\pi\right)^{-\frac{d_X}{2}}\left|Q\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)\right)\cdot\\

&\prod_{t=1}^{T}\left(2\pi\right)^{-\frac{d_Z}{2}}\left|R\right|^{-\frac{1}{2}}\exp\left(-{1\over 2}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)\right)

\end{align}

Expected log likelihood observing $Z_{1,\dots,T}$

\begin{align}

\mathbb{E}_{X_{1,\dots,T}}\log P(X_{1,\dots,T},Z_{1,\dots,T})

=& {\rm constant} {-\frac{1}{2}}\log\left|\Sigma_{0}\right|-{1\over 2}\mathbb{E}\left(X_{1}-\mu_{0}\right)^{\mbox{T}}\Sigma_{0}^{-1}\left(X_{1}-\mu_{0}\right)+\\

&\sum_{t=2}^{T}{-\frac{1}{2}}\log\left|Q\right|-{1\over 2}\mathbb{E}\left(X_{t}-F\cdot X_{t-1}\right)^{\mbox{T}}Q^{-1}\left(X_{t}-F\cdot X_{t-1}\right)+\\

&\sum_{t=1}^{T}{-\frac{1}{2}}\log\left|R\right|-{1\over 2}\mathbb{E}\left(Z_{t}-H\cdot X_{t}\right)^{\mbox{T}}R^{-1}\left(Z_{t}-H\cdot X_{t}\right)

\end{align}

Estimating state transition model $F$ and observation model $H$

\begin{align}

\frac{\partial}{\partial F_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&Q^{-\mbox{T}}\sum\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}=Q^{-\mbox{T}}F\sum\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}\\

\Rightarrow&F=\left(\sum_{t=2}^{T}\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}\right)\left(\sum_{t=2}^{T}\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}\right)^{-1}\\

\frac{\partial}{\partial H_{ij}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0\Rightarrow&H=\left(\sum_{t=1}^{T}Z_{t}\mathbb{E}X_{t}^{\mbox{T}}\right)\left(\sum_{t=1}^{T}\mathbb{E}X_{t}X_{t}^{\mbox{T}}\right)^{-1}

\end{align}

Solution: write down an equation involving

$\left(\matrix{\frac{\partial\log P}{\partial F_{11}} & \dots & \frac{\partial\log P}{\partial F_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial\log P}{\partial F_{d1}} & \dots & \frac{\partial\log P}{\partial F_{dd}} }\right)$

and solve $F$, where $d$ is the dimensionality of $X$. Apply the same procedure to solve $H$.

Estimating variance of process noise $Q$ and observation noise $R$

\begin{align}

\frac{\partial}{\partial Q_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&\left(T-1\right)Q-\sum_{t=2}^{T}\left(\mathbb{E}X_{t}X_{t}^{\mbox{T}}-F\mathbb{E}X_{t-1}X_{t}^{\mbox{T}}-\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}F^{\mbox{T}}+F\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}F^{\mbox{T}}\right)\stackrel{\mbox{set}}{=}0\\

\Rightarrow&Q=\frac{1}{T-1}\sum_{t=2}^{T}\left(\mathbb{E}X_{t}X_{t}^{\mbox{T}}-F\mathbb{E}X_{t-1}X_{t}^{\mbox{T}}-\mathbb{E}X_{t}X_{t-1}^{\mbox{T}}F^{\mbox{T}}+F\mathbb{E}X_{t-1}X_{t-1}^{\mbox{T}}F^{\mbox{T}}\right)\\

\frac{\partial}{\partial R_{ij}^{-1}}\mathbb{E}\log P\stackrel{\mbox{set}}{=}0 \Rightarrow&R=\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}Z_{t}Z_{t}^{\mbox{T}}-H\mathbb{E}X_{t}Z_{t}^{\mbox{T}}-Z_{t}\mathbb{E}X_{t}^{\mbox{T}}H^{\mbox{T}}+H\mathbb{E}X_{t}X_{t}^{\mbox{T}}H^{\mbox{T}}\right)

\end{align}

Solution: Note that the determinant of a matrix is $|A| = \sum_{\sigma\in S_n}{\rm sgn}(\sigma)\prod_{k=1}^d A_{k\sigma_k}$. Hence $\left(\matrix{ \frac{\partial|A|}{\partial A_{11}} & \dots & \frac{\partial|A|}{\partial A_{1d}}\cr \vdots & \ddots & \vdots\cr \frac{\partial|A|}{\partial A_{d1}} & \dots & \frac{\partial|A|}{\partial A_{dd}}}\right)\cdot A^{\mbox{T}}=|A|\cdot I$ and $\frac{\partial\log|A|}{\partial A}=\frac{|A|}{|A|}A^{-T}$.

### Schur complement

Our goal is to find inverse of the covariance matrix $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)$. In order to do that, we first use the 2nd column $\left(\matrix{\Sigma_{12}\cr\Sigma_{22}}\right)$ to Gaussian-eliminate $\Sigma_{21}$, which is equivalent to right multiply the covariance matrix with $\left(\matrix{I & 0 \cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)$. We then use the resulting 2nd row $\left(\matrix{0 & \Sigma_{22}}\right)$ to eliminate $\Sigma_{12}$, which is equivalent to left multiply the covariance matrix with $\left(\matrix{ I & \Sigma_{12}\Sigma_{22}^{-1} \cr 0 & I}\right)$. We performed the following operation.

$$

\begin{align*}

& \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)

\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)\\

= & \left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr & I}\right)

\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & \Sigma_{12}\cr 0 & \Sigma_{22}}\right)\\

= & \left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right).

\end{align*}

$$

If we move $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)$ and $\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)$ to the right hand side of the equation above, then $\left(\matrix{\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0\cr 0 & \Sigma_{22}}\right)$ and the resulting $\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)^{-1}$ to the left hand side of the equation, we get the following.

$$

\begin{align*}

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr\Sigma_{21} & \Sigma_{22}}\right)^{-1}

= & \left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)

\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)

\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

\end{align*}

$$

**Application of Schur complement to find the conditional probability density function of a multivariate normal distribution**. Let $x = \left(\matrix{X_1\cr X_2}\right)\sim \mathcal{N}\left(\left(\matrix{\mu_1\cr\mu_2}\right), \left(\matrix{\Sigma_{11} &\Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right) \right)$ be multivariate normal distribution, where $X_1$ and $X_2$ are vectors of dimension $p$ and $q$ respectively, and we want to find the conditional distribution of $X_1$ on $X_2=x_2$.

The probability density function of $X$ is

\begin{align*}

& f\left(\matrix{x_{1}\cr x_{2}}\right)\\

\propto & \exp\left(-\frac{1}{2}

\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)

\left(\matrix{\Sigma_{11} & \Sigma_{12}\cr \Sigma_{21} & \Sigma_{22}}\right)^{-1}

\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\

= & \exp\left(-\frac{1}{2}

\left(\matrix{\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)^{T}}\right)

\left(\matrix{I & 0\cr -\Sigma_{22}^{-1}\Sigma_{21} & I}\right)

\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)

\left(\matrix{I & -\Sigma_{12}\Sigma_{22}^{-1}\cr 0 & I}\right)

\left(\matrix{x_{1}-\mu_{1}\cr x_{2}-\mu_{2}}\right)\right)\\

=& \exp\left(\left(\begin{matrix}\left(x_{1}-\mu_{1}\right)^{T}-\left(x_{2}-\mu_{2}\right)^{T}\Sigma_{22}^{-1}\Sigma_{21} & \left(x_{2}-\mu_{2}\right)^{T}\end{matrix}\right)

\left(\matrix{\left(\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)^{-1} & 0\cr 0 & \Sigma_{22}^{-1}}\right)

\left(\matrix{x_{1}-\mu_{1}-\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right)\cr x_{2}-\mu_{2}}\right)\right).

\end{align*}

Hence $X_{1}|X_{2}=x_{2}\sim\mathcal{N}\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)$.

### Duality between maximum entropy / minimum divergence and maximum likelihood

1. Convexity of $f^∗(y)=\sup_x(y^\mbox{T}⋅x−f(x))$: We want to show that $f^*(\alpha_1*y_1+\alpha_2*y_2)\le \alpha_1*f^*(y_1)+\alpha_2*f^*(y_2)$ for any $0\le\alpha_1,\alpha_2\le 1$ with $\alpha_1+\alpha_2=1$, $y_1$ and $y_2$. In fact we have

$$\begin{eqnarray*}

f^{*}(\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2}) & = & \sup_{x}((\alpha_{1}\cdot y_{1}+\alpha_{2}\cdot y_{2})^{\mbox{T}}x-f(x))\\

& = & \sup_{x}(\alpha_{1}(y_{1}^{\mbox{T}}x-f(x))+\alpha_{2}(y_{2}^{\mbox{T}}x-f(x)))\\

& \le & \alpha_{1}\sup_{x_{1}}((y_{1}^{\mbox{T}}x-f(x)))+\alpha_{2}\sup_{x_{2}}((y_{2}^{\mbox{T}}x-f(x)))\\

& = & \alpha_{1}f^{*}(y_{1})+\alpha_{2}f^{*}(y_{2})\\

\end{eqnarray*}$$

2. If $f(x)$ is differentiable in $x$, the supremum $\sup_x y^\mbox{T}\cdot x – f(x)$ is attained at $x^*$ with ${\partial\over\partial x}=0$, i.e., $y-f'(x^*)=0$ or $y$ is the derivative of $f$ at point $x^*$. By definition of supremum, $y^\mbox{T}\cdot x-f(x)\le f^*(y)$. We call $(x^*, f'(x^*))$ Legendre conjugate pair.

3. Double Legendre-Fenchel dual $f^{**}(x)$ as the Legendre-Fenchel dual of $f^*(y)$ is convex as the result of 1. The double Legendre-Fenchel dual of a convex function is itself. Suppose $g(x)\ge f^{**}(x)$ is a convex function below $f$. Then $\max(f^{**},g)$ is a convex function below $f$. Hence double Legendre-Fenchel dual of a function is the maximal convex function below.

4. Let $x=(x_1,x_2,\cdots)$ be the mean parameters of categorical distribution. Then $\phi(x)=x^\mbox{T}\cdot\log x=(\log x)^\mbox{T} \cdot x$ is the entropy of this distribution, $\triangledown \phi(x) = 1+\log(x)$, and

$$\begin{eqnarray*}

d_{\phi}(y,x) & = & \phi(y)-\phi(x)-\triangledown\phi(x)\cdot(y-x)\\

& = & y^{\mbox{T}}\cdot\log y-x^{\mbox{T}}\cdot\log x-(1+\log x)^{\mbox{T}}(y-x)\\

& = & y^{\mbox{T}}\cdot\log y-y^{\mbox{T}}\cdot\log x\\

& = & d_{\mbox{KL}}(x,y)

\end{eqnarray*}$$

5. Since $\phi$ is convex, $\phi(\alpha\cdot x+(1-\alpha)\cdot y)\le\alpha\phi(x)+(1-\alpha)\phi(y)$ for $0\le \alpha\le 1$, or

$$\begin{eqnarray*}

\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{(1-\alpha)(\phi(y)-\phi(x))}{(1-\alpha)(y-x)}=\frac{\phi(y)-\phi(x)}{(y-x)},\\

\phi'(x)=\lim_{\alpha\to1}\frac{\phi(\alpha\cdot x+(1-\alpha)\cdot y)-\phi(x)}{(\alpha\cdot x+(1-\alpha)\cdot y)-x} & \le & \frac{\phi(y)-\phi(x)}{(y-x)},\\

d_{\phi}(y,x)=\phi(y)-\phi(x)-\phi'(x)\cdot(y-x) & \ge & 0.

\end{eqnarray*}$$

6. For $0\le \alpha_1,\alpha_2\le 1$ and $\alpha_1+\alpha_2=1$,

$$\begin{eqnarray*}

d_{\phi}(\alpha_{1}y_{1}+\alpha_{2}y_{2},x) & = & \phi(\alpha_{1}y_{1}+\alpha_{2}y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\

& \le & \alpha_{1}\phi(y_{1})+\alpha_{2}\phi(y_{2})-\phi(x)-\phi'(x)(\alpha_{1}y_{1}+\alpha_{2}y_{2}-x)\\

& = & \alpha_{1}\left(\phi(y_{1})-\phi(x)-\phi'(x)(y_{1}-x)\right)+\alpha_{2}\left(\left(\phi(y_{2})-\phi(x)-\phi'(x)(y_{2}-x)\right)\right)\\

& = & \alpha_{1}d_{\phi}(x,y_{1})+\alpha_{2}d_{\phi}(x,y_{2}).

\end{eqnarray*}$$

7.

$$\begin{eqnarray*}

d_{\phi_{1}+\phi_{2}}(x,y) & = & (\phi_{1}+\phi_{2})(y)-(\phi_{1}+\phi_{2})(x)-(\phi_{1}+\phi_{2})(x)\cdot(y-x)\\

& = & \cdots=d_{\phi_{1}}(x,y)+d_{\phi_{2}}(x,y)

\end{eqnarray*}$$

8. $\{x|d_\phi(x,u)=d_\phi(x,v)\}$ is a hyper plane: $d_{\phi}(x,u)=\phi(x)-\phi(u)+\phi'(u)(x-u)=\phi(x)-\phi(v)+\phi'(v)(x-v)=d_{\phi}(x,v)$. Rearranging the terms $\phi(u)-\phi(v)+\phi'(u)\cdot u-\phi'(v)\cdot v-(\phi'(u)-\phi'(v))\cdot x=0$.

9. We expand the expectationon $\mbox{E}_{X}d_{\phi}(X,u)=\mbox{E}\left(\phi(X)-\phi(u)-\phi'(u)(X-u)\right)=\mbox{E}\phi-\phi(u)-\phi'(u)(\mbox{E}X-u)$, and set its partial derivative over $u$ to be 0: $\frac{\partial}{\partial u}\mbox{E}_{X}d_{\phi}(X,u)=-\phi'(u)-\phi”(u)(\mbox{E}X-u)+\phi'(u)=0$. Since $\phi$ is convex, $\phi”\ge0$ and $u=\mbox{E}X$ minimizes expected K-L divergence.

10. $\psi(\theta)$ is convex-conjugate of $\phi(x)$, and $(\theta_1, x_1)$ and $(\theta_2,x_2)$ are conjugate pairs.

$$\begin{eqnarray*}

d_{\phi}(x_{2},x_{1}) & = & \phi(x_{2})-\phi(x_{1})-\phi'(x_{1})(x_{2}-x_{1})\\

& = & \theta_{2}^{\mbox{T}}\cdot x_{2}-\psi(\theta_{2})-\left(\theta_{1}^{\mbox{T}}\cdot x_{1}-\psi(\theta_{1})\right)-\theta_{1}(x_{2}-x_{1})\\

& = & \psi(\theta_{1})-\psi(\theta_{2})-x_{2}^{\mbox{T}}\left(\theta_{2}-\theta_{1}\right)\\

& = & d_{\psi}\left(\theta_{1},\theta_{2}\right).

\end{eqnarray*}$$

11

$$\begin{eqnarray*}

H(q)-\sum_{i}\alpha_{i}\left(E_{q}(f_{i})-E_{\tilde{p}}(f_{i})\right) & \leftarrow & \sum_{k}q_{k}\log q_{k}-\sum_{i}\alpha_{i}\left(\sum_{k}f_{i}(x_{k})q_{k}-E_{\tilde{p}}(f_{i})\right)\\

\frac{\partial}{\partial q_{k}} & = & 1+\log q_{k}-\sum_{i}\alpha_{i}f_{i}(x_{k})\stackrel{\mbox{set}}{=}0\\

\Rightarrow q_{k} & \propto & \exp\left(\sum_{i}\alpha_{i}f_{i}(x_{k})\right)\\

& & \mbox{take to the limit.}

\end{eqnarray*}$$

12. Given $q(x)=\exp(\theta\cdot g(x)-\Psi(\theta))$, $\log q(x)=\theta\cdot g(x)-\Psi(\theta)$. Setting partial derivative of log likelihood over parameter to 0: $\frac{\partial}{\partial\theta}\sum_{k}(\theta\cdot g(x^{(k)})-\Psi(\theta))=\sum_{k}g(x^{(k)})-N\cdot\frac{\partial}{\partial\theta}\Psi(\theta)\stackrel{\mbox{set}}{=}0$, we get $\mu=\frac{1}{N}\sum_{k=1}^{N}g(x^{(k)})$.

13. To find $\Psi^{*}(\mu)=\sup_{\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)$, we take the derivative of $\mu\cdot\theta-\Psi(\theta)$ over $\theta$: $$\frac{\partial}{\partial\theta}\left(\mu\cdot\theta-\Psi(\theta)\right)=\mu-\underbrace{\frac{\sum_{x}f(x)\cdot\exp(\theta\cdot f(x))}{\sum_{x}\exp\left(\theta\cdot f(x)\right)}}_{\triangledown_{\theta}\Psi=\mbox{E}(f(x))=\mu}\stackrel{\mbox{set}}{=}0.$$

Hence $\mu=\triangledown_{\theta}\Psi$, and $$\Psi^{*}(\mu)=\theta\cdot\triangledown_{\theta}\Psi-\Psi(\theta)=\mbox{E}(\frac{\theta\cdot f(x)}{\Psi(\theta)})=\sum_{x}p(x;\theta)\log p(x;\theta)=-H(\theta).$$

### Sample from finite state continuous time Markov process

1. There are 2 events in this system: weather change from sunny to rainy, and from rainy to sunny. Let the rate of the first event be $\alpha$ per day, and the rate of the latter event be $\beta$ per day. The number of events in $\Delta t$ days will be $\alpha\cdot\Delta t$ and $\beta\cdot\Delta t$ for $\Delta t$ small enough. This system is an inhomogeneous Poisson process.

Using first order approximation, the state transition kernel of one-day and one-hour is the following where $\Delta t$ takes value 1 and $1\over 24$ respectively,

$$\begin{matrix}\mbox{sunny}\\

\mbox{raining}

\end{matrix}\left(\begin{matrix}1-\alpha & \alpha\\

\beta & 1-\beta

\end{matrix}\right)\cdot\Delta t.$$

Refer lecture slides of 09/01 for sampling in R.

2. Refer lecture slides of 09/01 for EM in R.

3. If we do not have observations, we take $P(Y_t|X_t)=1$. In other words, we do not update latent state distribution.

4. Refer stochastic kinetic model algorithm of 10/01/2015.