Exponential Family, divergence and convex optimization [09/24/2015]

prepared by Hao Fan, Manu Kothegal Shivakumar, and Pritika Mehta.

Maximum entropy principle

Entropy is a measure of unpredictability of information content.  The Shannon entropy $Η$ of a discrete random variable $X$ with possible values $\{x_1, \cdots, x_n\}$ and probability mass function $P(X)$ is defined as:


Here $E$ is the expected value operator, and $I$ is the information content of $X$. $I(X)$ is itself a random variable. (Wikipedia, Entropy (information theory), 2015)

In a discrete case, the entropy is the sum over the logarithm of the probability mass function.  Here the weight is set to be the probability for us to get a specific value. In other words a Shannon entropy is the expected value of the log density or the log probability of mass and also take the minus sign, it can explicitly be written as


When the distribution is continuous rather than discrete, the sum is replaced with an integral as


where $P(x)$ represents a probability density function.

It can also be described as

$H(X) = \int_{-\infty}^{\infty}f(x)logf(x)dx$ if $X ∈ R$ and $\frac{dP(x)}{d\mu(x)} = f(x)$

where $\mu$ is Lebesgue measure.

The concept of maximum entropy principle is that, subject to precisely stated prior, the probability distribution which best represents the current state of knowledge is the one with largest entropy.

Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. Of those, the one with maximal information entropy is the proper distribution, according to the Principle of maximum entropy.

One question is why do we maximize the entropy, or why the probability distribution with the maximum entropy is the distribution with the least amount of information? The following example can help explain this.

If we throw a dice, a dice has six faces and the distribution among six faces with the maximum entropy is a uniform distribution. The probability distribution with minimum entropy as the distribution that we get some number one or two or three and so on and so forth. With a distribution, with the minimum, with zero entropy meaning that we know the result of the throw. In that way we do not have to guess and we have a distribution with zero entropy, meaning zero uncertainty there. Uniform distribution of all six faces as a distribution with maximum entropy meaning that, the expected value of the logarithm of the probability mass function is maximized and negative is maximized. So we can see by maximizing entropy, we minimizes prior information built into the distribution. And also many physical systems tend to move towards maximal entropy configurations over time.

Maximum entropy is constantly related to a family of distributions which is called the exponential distribution. According to the theory of exponential family, the distribution which maximizes entropy subject to constraints on moments lies in the exponential family with sufficient statistics corresponding to these moments.

We have already talked about many distributions such as geometric distribution, exponential distribution, normal distribution, uniform distribution.  Their relations with maximal entropy is as below.

  • Geometric distribution: maximizes entropy given mean on natural numbers
  • Exponential distribution: maximizes entropy given mean on positive real numbers
  • Normal distribution: maximizes entropy given mean and stand deviation
  • Uniform distribution: maximizes entropy on the support

A generalization of these distributions is is the Boltzmann distribution, which is from the statistic physics. This distribution has an invariance and is called n statistics or n features. N statistic is something that we can extract from our sample like we sum our samples together or we multiply our samples together or we take the maximum value or we take the minimum value.

Here the invariance is the expected value of those features and here we use f to denote the features and there are N features. In this distribution, the average value of our features is given and we want to maximize the entropy of this distribution with a constraint of the invariance. The format is as below.

Given n invariants $E(f_i(X)) = a_i$ for $i = 1,2,\cdots,n$

  • $p.m.f.p_k=c\cdot exp(\sum_{i=1}^n\lambda_i f_i(k))$ for $X = 1,2,\cdots$
  • $p.d.f.p_k=c \cdot (\sum_{i=1}^n\lambda_i f_i(k))$ for $X∈S∈R$
  • $c$ is partition function

Suppose that the support of this random variable meaning that the possible values of this random variable only takes a finite number of values from $1$, $2$, $3$, $4$, to some $K$. Then the probability mass function as we can show has some exponential form. We can look this as a dot product with those $\lambda$ and those parameters and those $f$ in a form of a dot product. If $X$ the value of this random variable is on the real numbers or on the high dimensional space of Euclidean space, then the probability density function is also the exponential form.

The proof is through the method of Lagrange multipliers.

There is

$J(P)=-\sum_{1}^k p_k logp_k=\mu(\sum_{1}^k p_k – 1)=\sum_{i=1}^n \lambda_i (\sum_{1}^k p_k f_i(k)-\alpha_n)$

Take derivative with respect to  and set it to 0, we have

$-1-logp_k+\mu+\sum_{i=1}^n\lambda_i f_i(k)=0$

The solution is

$p_k=\frac{exp(\sum_{i=1}^n\lambda_i f_i(k))}{exp(1-\mu)}=c\cdot exp(\sum_{i=1}^n\lambda_i f_i(k))$


$ c=\frac{1}{exp(1-\mu)}$

The next thing that we talk about is called Kullback-Leibler Divergence. Kullback–Leibler divergence (also known as information divergence, information gain, relative entropy, KLIC, or KL divergence) is a non-symmetric measure of the difference between two probability distributions $P$ and $Q$. Specifically, the Kullback–Leibler divergence of $Q$ from $P$, denoted , is a measure of the information lost when $Q$ is used to approximate P.

For category distribution. The KL divergence is as below

$D_{KL}(P || Q)=\sum_{i=1}^n P(i)log\frac{P(i)}{Q(i)}$

which is the probability of mass at a category I, probability of mass of P on category I over the probability of mass of Q on category I taking the logarithm and sum all of those quantities together with using the weight of probability P and those categories.

One thing we note is that if at some time $Q=0$ but $P\ne 0$ then we have some infinity. It’s something that we cannot use and similarly if P is 0 and $Q\ne 0$ at some category then we get something that is 0. For this KL divergence over category distributions to make sense we have to specify that the Q, whenever $P\ne 0$, Q shouldn’t be 0, meaning that if $Q=0$, P has to be 0. This is a condition that P is absolutely continuous with respect to Q.

For the continuous version suppose that the P and Q are supported by a real number then

$D_{KL}(P || Q)=\int dPlog\frac{dP}{dQ}$

To help us understand, we can consider as a distribution of, for example, on a real line we can have probability over intervals and on a two dimension space we can have probability over squares. Then over any interval we can find the probability of P over this interval. We can also find the probability Q over this interval. Then we divide the probability of P over this interval by the probability of Q over this interval.

Another thing we note is that if we compare this one with the entropy, we can find it is pretty similar to the entropy except that Q has uniform distribution on its support. So from there if we add or subtract some constant from the KL divergence of P over some uniform distribution of Q, we get to the negative entropy.



Wikipedia. (2015, 9 16). Entropy (information theory). Retrieved from Wikipedia, the free encyclopedia: https://en.wikipedia.org/wiki/Entropy_(information_theory)
Wikipedia. (2015, 8 27). Kullback–Leibler divergence. Retrieved from Wikipedia, the free encyclopedia: https://en.wikipedia.org/wiki/Kullback-Leibler_divergence
Wikipedia. (2015, 8 27). Principle of maximum entropy. Retrieved from Wikipedia, the free encyclopedia: https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#Overview

Expectation Maximization from minimizing K-L Divergence

The Expectation Maximization algorithm can be used to solve the maximum likelihood parameter estimation problem for a model with hidden/latent variables. The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps:

  • Expectation step: the (missing) data are estimated given the observed data and current estimates of model parameters
  • Maximization step: The likelihood function is maximized under the assumption that the (missing) data are known

Let $Y$ denote the observable variables and let $X$ denote the latent variables. $Y = (Y_1,Y_2,\cdots,Y_n)$ where $Y_i=(y^{(1)}_i,y^{(2)}_i,y^{(3)}_i,\cdots,y^{(n)}_i)$ represent the observed values of $Y$. The probability model would be $p(y,x|\theta)$. If $x$ was also observed we could easily find out $\theta$ by maximizing $\log(y,x|\theta)$.

The log-likelihood of observing $y$ is $\log p(y|\theta) = \log \sum_x p(y,x|\theta)$. As $p(y,x|\theta)$ depends on $\theta$ and is unknown at the time when we estimate $\theta$, we average over $X$ using auxiliary distribution $q(x|y)$, and the expected log likelihood becomes $\sum_x q(x|y,\theta) \log p(y,z|\theta)$.

If $q$ is chosen carefully, the expected log likelihood won’t be far from the actual log likelihood and can serve as a effective surrogate. Even though it does not promise to yield a value of ${\theta}$ that maximizes the likelihood but it does promise improvement of ${\theta}$ with each iteration. This is the basic idea behind Expectation Maximization Algorithm.

We will begin the derivation of Expectation Maximization by showing that the auxiliary distribution can be used to provide the lower bound.


\log p(y|\theta) & = & \log\sum\limits _{x}\frac{q(x)\times p(y,x|\theta)}{q(x)}\\
& \ge & \mbox{E}_{q}[\log\frac{p(y,x|\theta)}{q(x)}]\mbox{ (Jensen’sinequality)}\\
& = & \mbox{E}_{q}[\log\frac{p(x|y)\times P(y|\theta)}{q(x)}]\\
& = & \mbox{E}_{q}[\log p(y|\theta)]-\mbox{E}_{q}[\log\frac{q(x)}{P(x|y,\theta))}]\\
& = & \mbox{E}_{q}[\log p(y|\theta)]-\mbox{D}_{\mbox{KL}}(q(x)||P(x|y,\theta)).

In the above equation we have simply shown that for any distribution $q(y|x)$, $L(q,\theta)$ is a lower bound on the log-likelihood using the Jensen’s Inequality(explained later on).Since the first term in the above equation does not depend upon $q(x)$, we just need to minimize the second term which is the KL divergence. Gibb’s distribution states that KL divergence is non-negative and is zero only when both of the distributions are identical which gives $q(x) = P(x|y,\theta)$.


The M-step then consists of maximizing this lower bound with respect to the parameters. Returning to our lower bound

& & \log(P(y|\theta)))\\
& \ge & \mbox{E}_{q}[\log(P(y,x|\theta)/q(x)]\\
& = & \mbox{E}_{q}[\log(P(y;x|\theta)]-\mbox{E}_{q}[\log(q(x))]\\
& = & \mbox{E}_{q}[\log(P(y;x|\theta)]+H(q(x)).

Here the second entropy term does not depend on $\theta$, so we simply set $\theta =\mbox{argmax} \mbox{E}_q[log(P(x,y|\theta)]$. That is, we wish to marginalize the expectation of the complete-data log-likelihood with respect to our auxiliary distribution $q(y)$.

Jensen’s inequality:

The intuitive explanation of Jensen’s inequality would be that it permits a comparison between the expectation of a transformation g of a random variable $X$ and this transformation g of the expectation of this random variable $X$.

In the case of g being linear, the two notions are the same — this is the linearity of expectation, where both values are (in the diagram) the y-coordinate of $M$.

In the case of g convex, we compare the image of $X$ under g with its image under the linear approximation of g at the expectation — it is very intuitive that convex functions are uniformly greater than their linear approximations (which are tangent lines, or more generally tangent hyper-planes). Since, as stated, the expectation of this linear function occurs at the expectation of the variable itself, it is also very intuitive that the expectation of the convex image of the variable is greater than the function at the expectation — the red curve biases the image of the variable to be on average greater than the y-coordinate of $M$.

Likewise, in the case of g concave, wherein it is smaller than any linear approximation, we expect the concave image of the random variable to be smaller than the image of its expectation under the concave function g. By the diagram, we would expect this concave curve to bias the image of the random variable below the y-coordinate of M.

In formulas: For $g$ linear, $\mbox{E}(g(X))=g(\mbox E(X))$. For $g$ convex: $\mbox E(g(X)) \ge(\mbox E(X))$. For g concave, $\mbox E(g(X)) \le g(\mbox E(X))$.

Example: Assume we play a game – roll a dice and you win dollars equal to the square of the number of dots (say $X$) that show up. Then $X$ is a random number and $f(X) = X^2$, is the payoff. What is the expected payoff? $\mbox E[f(X)] = (1 + 4 + 9 + 16 + 25 + 36)/6 = 15.17$. What is the expected outcome of the dice? $\mbox E[X] = (1+2+\cdots +6)/6=3.5$. What is the payoff for the expected outcome? $f(\mbox E[X])=12.25$. Clearly, $\mbox E[f(X)] > f(\mbox E[X])$.

Now if we change the game such that the payoff is twice the outcome of the dice, then $f(X) = 2X$, then $\mbox E[f(X)] = (1+2+3+..6)*2/6 = 7$ and $f(\mbox E[X]) = 2*3.5 = 7$. Clearly, $\mbox E[f(X)] = f(\mbox E[X])$.

Convexity in the payoff for the first game [X^2] is responsible for the increased payoff. The linear payoff in the second game ensures that $\mbox E[f(X)] = f(\mbox E[X])$. Had the payoff been concave, then we could have noticed $\mbox E[f(X)] < f(\mbox E[X])$. The above point can be understood mathematically by performing a taylor expansion of $f$ around the mean of $X$. The second order term explains the behavior with respect to convex($f”(X)> 0$), concave ($f”(X)< 0$), and linear ($f”(X)=0$) functions.


  • http://www.david-andrzejewski.com/publications/em.pdf
  • An introduction to probabilistic graphical models (Michael I Jordan)

Convex Optimization

Legendre transform or convex conjugate for a given function f(x) has another variable y

$$f^{*}(y) = supp_{x} (y^{T}x – f(x)) $$
where supp is supremum.

Why do we care about convex conjugate?
1. The convex conjugate function $f^{*}(y)$ is a convex function even if f(x) is not convex.
2. The convex conjugate of the convex conjugate is also a convex function.

The Legendre Transformation $f^{*}$ is an encoding of a function f’s supporting hyperplane.

Consider function h(x) = <a,x> + b be set of all lines that are below the convex function shown in above picture($h(x)$ includes lines that don’t touch the convex function as well, as shown in figure above). Now $f(x) >= h(x)$ for all x. Now $f(x) = supp_{x}(h(x))$ Why? Lets consider a point x = a(show in figure) supremum value of h(x) at x=a would be the point touching the function(we choose the nearest point between all the blue points). So if we get $supp_x(h(x))$ for all x we get back our convex function f(x).

How do we prove that a function is a convex function?

In the first figure if we take any two points and join them with a line (shown as blue lines) all the points on the line are above the function so its a convex function.
But in case of second figure the function is not convex as the points on the pink line are above and below the function.

The convex/Legendre conjugate function of entropy is negative log partition function.
Given E[X] = r, X={1, 2, 3, 4…}
maximize $\sum_{i=1}^{\infty} P_{i}logP_{i}$

Method of Lagrange multipliers:

$L = \sum_{i=1}^{\infty} x_{i}logx_{i} + \lambda(\sum_{i=1}^{\infty}ix_{i}) + \mu(\sum_{i=1}^{\infty}x_{i} – 1)$

$\frac{dL}{dx_{i}} = logx_{i} + 1 + \lambda i + \mu$

setting $\frac{dL}{dx_{i}} = 0$

$x_{i} = e^{-1 -\lambda i -\mu}$

lets assume $y = -\lambda i -\mu$

then $x_{i} = e^{y -1}$ where y is a function of $\lambda$ and $\mu$

Lets use the convex conjugate:

$f(x) = \sum_{i=1}^{\infty} x_{i}logx_{i}$

Writing the convex conjugate:

$f^{*}(y) = supp_{x}(yx – f(x))$

$f^{*}(y) = supp_{x}(yx – \sum_{i=1}^{\infty} x_{i}logx_{i})$

$\frac{d}{dx_{i}} yx – \sum_{i=1}^{\infty} x_{i}logx_{i} = 0$

$y – (1+logx_{i}) = 0$

$x_{i} = e^{y – 1}$

$f^{*}(y) = \sum_{i=1}^{\infty} e^{y_{i}-1}$

What if P is not a exponential function? If P is not a exponential function, the solution is going to be pretty hard but we can use approximations. What we do is, we first take the convex conjugate of P and after we take the convex conjugate of P we take the double convex conjugate of P, which will involve some random variable.

Lets talk about generalized divergence. Lets consider a convex function f, Bregman divergence is basically the distance between function f and its linear approximation as shown above.

Lets consider a convex function $\phi$ and define divergence of 2 points x and y. x and y could be real numbers or functions. From the above figure we can see that the divergence is given by
$$d_{\phi}(y,x) = \phi (y) – \phi (x) – \triangledown \phi (x)^{T}(y – x)$$
what is the distance, how good is it that we want to approximate Y with this X? That’s the meaning of the Bregman divergence.

Interestingly if $\phi$ is negative entropy then the Bregman divergence is KL Divergence.

$\phi (x) = \sum_{i=1}^{n}x log(x)$

$d(y,x) = \sum_{i=1}^{n}ylog(y) – \sum_{i=1}^{n}xlog(x) – \sum_{i=1}^{n}(1+log(y))(y-x)$

$ = \sum_{i=1}^{n}ylog(y) -\sum_{i=1}^{n}xlog(x) + \sum_{i=1}^{n}x – \sum_{i=1}^{n}y + \sum_{i=1}^{n}xlog(x) -\sum_{i=1}^{n}ylog(x)$

$ = \sum_{i=1}^{n}y log\frac{y}{x} + \sum_{i=1}^{n}x – \sum_{i=1}^{n}y$

if x and y are pdfs $\sum_{i=1}^{n}x = 1$ and $\sum_{i=1}^{n}y = 1$

$ = \sum_{i=1}^{n}y log\frac{y}{x}$ (KL Divergence!)

Properties of Bregman divergence:

1. $d_\phi(y,x)\ge 0$

2. convex in y

3. It is a linear in $\phi$, which is pretty obvious from the definition.

4. Linear separation $\{x| d_\phi(x,u)=d_\phi(x,v)\}$ is a hyperplane ie We take two points U and V (U and V could be one dimension, distribution or something more complicated). All of these points x, such that xu and xv, have the same divergence defined by $\phi$ as a hyper plane or as a hyper space. The dimensionality of this plane is dimensionality of U and dimensionality of V. It could be infinite dimension, if we think of U and V as functions.

5. $\mbox{argmin}_u \mathbf E_X d_\phi (X, u) = E_X X$. The expected value of the divergence between X and some fixed number U, over distribution of X is X. the best way to take U is the expected value of X.

6. Conjugate Duality: Let $\psi(\theta)=\phi^*(\theta)$ be conjugate of $\phi(x)$ then $d_\phi(x_1,x_2)=d\psi(\theta_1,\theta2)$. The Bregman divergence defined by $\phi$ equals the Bregman divergence defined by \psi which is the conjugate function.

Exponential Family, divergence and convex optimization [09/24/2015]

Leave a Reply