In this lecture, we’ll talk about several interesting probability distributions. I assume that perhaps you have already heard of those distributions before, but what I’m going to tell you is how those distributions come into being. After you know how those distributions come into being, you will not feel those distributions as mysterious. Another message that I want to tell you is that what I find is that engineering students, including me many, many years ago, normally just pick a distribution and use this distribution for random, without thinking about what is the connotation of such probability distribution. I hope that after you know how those probability distributions that I talk about in this lecture come into being, you will think about how the other distributions come into being. Then it will help you reason about the distributions that you will work with in the future.

### Distributions from repeated Bernoulli trials

Let us tossing a coin repeatedly. Whenever we toss a coin, we conduct a **Bernoulli trial**, so we have, as a result, we could have a head up or we could have a tail up. We assign probability $p$ to the event that we get a head up. We assign probability $1-p$ to the events that we get a tail up. We call a head up as a success. We call a tail up as a fail. If we toss this coin for many, many times then we get interesting random variables and distributions. Formally speaking, let us use random variable $X$ to denote the result of a Bernoulli trial. $X$ can take two values: $X$ takes 1 with probability $p$ ($P(X=1)=p$) and $X$ takes 0 with probability $1-p$ ($P(X=0)=1-0)$.

What is the mean value and variance of this Bernoulli trial?

The expected value of $X$ is $\mbox E(X)=1\times p+0\times(1-p)=p$ according to the definition of expected value. This agrees with out intuition: If we bet on the result of a coin toss — if we win \$1 for a head and and nothing for a tail and if the probability of getting a head is $p$, we expect to get \$ $n\cdot p$ after $n$ trials, and \$ $p$ after 1 trial.

The variance of $X$ according to definition is the expected value of the square deviation of the outcomes to their mean value, $$\mbox{Var}(X)=\mbox E\left(X-\mbox E(X)\right)^2=(1-p)^2\cdot p+(0-0)^2\cdot(1-0)=p(1-p)$$.

Interesting random variables and distributions will arise from repeated coin tosses. If we throw this coin for $n$ times, then the number of heads $X$ after $n$ tosses has a **binomial distribution**. The number of fails until the first success and including the first success has **geometric distribution**. The number of successes before a specific number of fails has a **negative binomial distribution**. We will explain why it is called negative binomial distribution. The time until the first success, if the rate of success is $\lambda$, and the probability of success in any infinitesimal time $\Delta t$ is $\lambda\times\Delta t$, has **exponential distribution**. Geometric distribution is a discretized version of exponential distribution. Exponential distribution is the continuous version of geometric distribution. The time to the $k^\mbox{th}$ success has a **Gamma distribution**. Negative binomial distribution is the discrete version of gamma distributions, and gamma distribution is the continuous version of negative binomial distribution. The time to the $k^\mbox{th}$ success out of the $n$ successes happening in a fixed time has **Beta distribution**. If we throw a dice instead of a coin, the numbers of times we get faces 1-6 out of $n$ throws has **multi-nominal distribution**. As we conduct more and more experiments, the average will have first moment, and second moment, with higher moments all vanish. It has **normal distribution**.

### Binomial Distribution

Let us find the probability of getting $k$ heads from $n$ tosses, which is the probability for $X=\sum_{i=1}^n X_i$ to be $k$, where $X_1, X_2, \cdots X_n$ are the outcomes from a single toss. $X$ has binomial distribution with rate $p$ and number of trials $n$.

What is the probability that we get 2 heads out of 5 toss? Let us use $1_1$ and $0_1$ to denote a success and a failure respectively from the first toss — $X_1=1$ if we get $1_1$ and $X_1=0$ if we get $0_1$. A sequence $1_1, 1_2, 0_3, 0_4, 0_5$ of tosses has 2 successes, happing with probability $p\cdot p\cdot (1-p)\cdot (1-p)\cdot (1-p)$, where $p$ is the probability of getting $1$ and $1-p$ is the probability of getting $0$. Any permutation of the sequence $1_1, 1_2, 0_3, 0_4, 0_5$ has 2 successes, and there are $5!$ of them. The $5!$ permutations fall into ${5\choose 2}={5!\over 2!\cdot(5-2)!}$ bins, with each bin containing the $2!(5-2)!$ sequences that are the same with the subscript is dropped. For example sequence $1_1, 1_2, 0_3, 0_4, 0_5$ and $1_2, 1_1, 0_3, 0_4, 0_5$ are in the same bin and they are the same sequence after their subscripts are dropped. Hence the probability of getting 2 successes out of 5 tosses is ${5\choose 2}p^2(1-p)^{5-2}$. In general the probability of getting $k$ successes out of $n$ tosses is ${n\choose k}p^k(1-p)^{n-k}$, where $0\le k\le n$. Thus a binomial random variables is from counting the number of successes in repeated Bernoulli trials.

What is the expected value of a binomial random variable $X$?

$$\begin{eqnarray*}

\mbox{E}X &=& \mbox{E}\sum_{i=1}^n X_i = \sum_{i=1}^n \mbox{E} X_i = n\cdot p.

\end{eqnarray*}$$

What is the variance of the binomial random variable? There is a easy way, and there is a hard way. The harder way is through definition.

$$\begin{eqnarray*}

\mbox{Var}(\sum_{i=1}^{n}X_{i}) & = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}-\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\

& = & \mbox{E}\left(\left(\sum_{i=1}^{n}X_{i}\right)^{2}-2\left(\sum_{i=1}^{n}X_{i}\right)\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)+\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\right)\\

& = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)^{2}-2\mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)+\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\

& = & \mbox{E}\left(\sum_{i=1}^{n}X_{i}\right)^{2}-\left(\mbox{E}\sum_{i=1}^{n}X_{i}\right)^{2}\\

& = & \sum_{i=1}^{n}\mbox{E}X_{i}^{2}+\sum_{i,j=1}^{n}\mbox{E}\left(X_{i}X_{j}\right)-\left(n\cdot p\right)^{2}\\

& = & n\cdot\left(p(1-p)+p^{2}\right)+n^{2}\cdot p^{2}-\left(n\cdot p\right)^{2}\\

& = & n\cdot p(1-p)

\end{eqnarray*}$$

The easier way is to note that the Bernoulli trials $X_1,\cdots,X_n$ are i.i.d. or independent, and identically distributed and that the variance of the sum of independent random variables is the sum of the variances of those variables: $$\mbox{Var}(\sum_i^n X_i)=\sum_i^n\mbox{Var}(X_i)=n\cdot p\cdot(1-p)$$.

Let $X_1$ be a binomial random variable that counts the number of successes in $n_1$ independent Bernoulli trials, with the probability of success in each trials being $p$. Let$X_1$ be a binomial random variable that counts the number of successes in $n_2$ independent Bernoulli trials, with the probability of success in each trials also being $p$. It follows that $X_1+X_2$ is a binomial distribution that counts the number of successes in $n_1+n_2$ independent Bernoulli trials, with the probability of success in each trials being $p$.

### Multinomial Distribution

If we throw a dice instead of a coin, the outcome is a **categorical distribution** over the face numbers $\{1,2,3,4,5,6\}$. The probabilities for us to get each of the six numbers sum up to 1. In general, we can define a categorical distribution over all $K$ possible outcomes.

We can throw a dice repeatedly, and the numbers of occurrences of all 6 numbers is a **multinomial distribution**. What is the probability that out of $n$ independent throws, we get $1$ for $n_1$ times, get $2$ for $n_2$ times, get $3$ for $n_3$ times, and so on, supposing that in a single throw we get 1 with probability $p_1$, get $2$ with probability $p_2$, get $3$ with probability $p_3$ and so on? The probability of getting a single sequence is $$P(x_1, x_2,\cdots,x_n)=p_{x_1}\cdot p_{x_2}\cdots p_{x_n}=p_1^{\sum_{i=1}^n \delta(x_i=1)}\cdots p_K^{\sum_{i=1}^n \delta(x_i=K)}=p_1^{n_1}\cdots p_K^{n_K}$$. The probability of any of the $n!$ permutations of this sequence will be the same. The permutations of $n_k$ tosses of outcome $k$ for $k=1,\cdots,K$ will give us the same sequence. Hence the probability of getting $n_1$ $1$, $n_2$ $2$, $\cdots$, $n_K$ $K$ is $P(n_1,\cdots,n_K)={n! \over n_1!\cdots n_K!} p_1^{n_1}\cdots p_K^{n_K}$.

If we throw the dice twice, the outcome of the first throw is independent of the outcome of the second throw. But if we throw this dice for just once, the outcomes are correlated — if we get a 1, we cannot get a 2 out of a single throw.

### Geometric Distribution

The number of failed trials until we get the first success has geometric distribution. What is the probability of getting $k-1$ fails until the first success? The **probability mass function** is $P(X=k)=(1-p)^{k-1}p$ — We first get $k-1$ fails with probability $1-p$ each time, then we get a success with probability $p$. What is the probability for $X$ to be less than or equal to $k$ (for $X$ to be $1,2,\cdots,k$)? The **cumulative probability function** is $P(X\le k) = 1-P(X>k) = 1-(1-p)^k$ — we get failed trials for the first $k$ times. The mean value and variance of a geometric random variable can be found through elementary math.

A random variable with geometric distribution has **memory-less property**, $P(X>s+t)=P(X>s)P(X>t)$, the probability of getting fails in the first $s+t$ trials equals the probability of getting fails in the first $s$ trials and the probability of failing in the next $t$ trails, and the probability of failing in the next $t$ trials equals the probability of failing in the first $t$ trials (by the time homogeneous of our coin). The geometric distribution is also the distribution of **maximum entropy** over distributions with support on non-negative numbers with mean value $\mbox{E}(X)$.

Suppose the arrival of the school bus has geometric distribution, then no matter what time we arrive at the bus station, we expect to wait the same amount of time to get the next bus, which is the average waiting time. If it is the case, we can use geometric distribution to model the arrival time of school bus. Otherwise, a geometric distribution is not a pretty good distribution for this type of model.

### Negative Binomial Distribution

A negative binomial distribution is the distribution of the number of successes before we get $r$ fails, including the $r^\mbox{th}$ fail. A geometric distribution is a special case of negative binomial distribution. It is the distribution of the time until the first success, if we just reverse success and fail.

What is the probability mass function of the negative binomial distribution? What is the probability that we have $K$ successes until we get $R$ fails? We have conducted a series of experiments, and in a series of experiences, we have $K$ successes, and we have $R$ fails, so we have a specific sequence with $K$ successes and $R$ fails, where the probability is $P$ to the power of $K$, and $1-P$ to the power of $R$, because we have $K$ successes, and $R$ fails. The number of sequences that we have for us to have K successes, and R fails, and that the last one is always a fail, so we have already worked this out for many times.${\bf P}(X=k) = {k+r-1\choose k}p^{k}\cdot \left(1-p\right)^r$

That is how we get the probability mass function for negative binomial distribution. The expected value of negative binomial distribution is $pr\over 1-p$. This is intuitively.

Why it is called a negative binomial distribution? The reason is that it is related to the binomial distribution in a foreign sense. The probability for the number of successes before R fails. This is a random variable, because out of different samples, we will get different number of successes before R fails. The probability for this random variable to be less than $S$ is the same as the probability that we have more than $R$ successes after $R+S$ trials. Our successes after $R+S$ trials is another random variable, which is binomial random variable. What we are saying here, is that the probability for this negative binomial random variable to be less than $S$, is going to probability for a binomial random variable, which is the result of $R+S$ trials, to be greater than $R$. That is the relationship between the negative binomial distribution, and the binomial distribution, and it is the reason we call this negative binomial distribution.

$P(\mbox{number of success before r failure}\le s) = P(\mbox{more than r successes after r+s trails})$

### Exponential Distribution

The next distribution is the exponential distribution. The exponential distribution is the limit of geometric distribution. Previous, in geometric distribution, we conduct one experiment in 1 unit time. Now here we see that we can conduct a fractal experiment, in a sense. We can conduct experiment in $\Delta t$ time, except that the probability of success in this $\Delta t$ is $\lambda\cdot\Delta t$, so we have a constant rate. The smaller the interval, the smaller the probability of success.

This is how we get the probability for $X$ to be greater than $t$ is going to be this one.

$P(X\ge t) = \lim_{\Delta t\to 0}(1-\Delta t\cdot\lambda)^{{t\over\Delta t}-1}\cdot (\Delta t\cdot\lambda) / \Delta t = \lambda \exp(-\lambda t)$

If we take the derivative, then we will get the probability density function for exponential distribution. This is how exponential distribution arise from repeatedly throw coins. Similarly, we can find out the expected value of the exponential distribution, and the variation of exponential distribution. Since we have already said that the exponential distribution is continuous to its counterpart of a geometric distribution, then the same memory layers property of a geometric distribution can be carried over to a random variable of exponential distribution. That is how we get the memory-less property here.

Another very interesting property about exponential distribution is, suppose we have two exponential distributions. $Y1$, and $Y2$. Then, the minimum of $Y1$ and $Y2$ is also exponential distribution. We call it memory-less property. This is a distribution that has the maximum entropy, in a sense that each time we start afresh, and the past really does not matter. Each time, we have the same rate of event happening. That is why exponential distribution occurs a lot in engineering, and in other branches of science.

$\textbf{memoryless property}$: ${\bf P}(X>s+t) = {\bf P}(X>s){\bf P}(X>t)$

Exponential distribution is the $\textbf{only}$ continuous distribution with memoryless property

$\min(Y_1,Y_2)$ has exponential distribution with rate $\lambda_1+\lambda_2$ if $Y_1$ and $Y_2$ have exponential distribution with rate $\lambda_1$ and $\lambda_2$ respectively

### Poisson Distribution

The next distribution is the poisson distribution. The poisson distribution is the number of success in unit time. We talk about the time until the first fail, the time until the first success, the time until $R$ success, and so on. Here, we’re looking at a fixed interval. Then we’re looking at the number of successes in this interval. So, what is the distribution of this one? It also arises from taking the limit from binomial distribution. We take a unit time, and we divide this unit time in equal parts. The rate of success is $\lambda$. The probability of success of each equal part, each unequal part is going to be $\lambda\over n$.

If we take N to the limit, we will get this one.

$\lim_{n\to\infty}{n\choose k}({\lambda\over n})^k(1-{\lambda\over n})^{n-k}=e^{-\lambda}{\lambda^k\over k!}$

E to the power of minus lambda, which is a scaling factor to make this to be 1. We also have this one. The reason is that, this one, if we take N to the limit, it is some kind of exponential term. The reason is that this one … 1 plus X to the power of 1 over X, as X, 10 to zero, is going to be E. We can change the size to other things. We get little variation from this end result of E. That is how we get this one. I think the main tool for us to get this distribution from, by taking the limit as this one.

We can similarly find out the expected value of this Poisson distribution.

${\bf E}X = \lambda$

There’s a square here, and the variance of this Poisson distribution. A good property of Poisson distribution is that, if we have $Y1$ and $Y2$ both to be Poisson distribution with rates $\lambda1$ and $\lambda2$, then the sum of those two Poisson distributions is going to be another Poisson distribution, with the rates $\lambda1 + \lambda2$.

We also have splitting property, which is that if $X$ has a Poisson distribution with rates $\lambda$. We sample from a Poisson distribution, and we conduct Bernoulli trials over this unit time, corresponding to each success. We name these as success with probability of $P$, and we name these as a failure as probability of $1-P$. The result of the name success is still a Poisson distribution, but the rate is thinner, which is $P\cdot \lambda$. This is a thinning property of Poisson distribution.

### Gamma Distribution

The next distribution is gamma distribution, which arises often as a distribution of exponential family. The gamma distribution is the time until the $k^{\rm th}$ success. We talk about the negative binomial distribution, right? The gamma distribution is continuous counterpart of negative binomial distribution. Because gamma distribution is a continuous random variable, so we have a probability density function. The way to find out the probability density function of a gamma distribution is by considering the two distributions.

First, let’s specify that $X$ is the time to the $k^{\rm th}$ event in a Poisson process, meaning that if we sample the time to the next event of success, then we sample the time to the next success, and so on, so forth. Let’s use $Y$ to count the number of events in the interval from zero to $x$. The probability for $X$ to be less than $x$ is a cumulative probability function of a gamma distribution. The time to the $k^{\rm th}$ success is less than $x$, meaning that, in the interval from zero to $x$, there will be greater than or equal to $K$ number of events. If we talk about greater or equal to $K$ number of events, we’re talking about a Poisson distribution, and we say it is a random variable of Poisson distribution to take $K+1$, and so on, and so forth.

If we write this out, we get this one.

$P(X\le x)=1- P(Y<k)=1 – \sum_{i=0}^{k-1}{(\lambda x)^k\over k!}\exp(-\lambda x)$

That has the probability for this gamma distribution to be less than $x$.Because we are talking about continuous random variable, we can take the derivative, and find out the probability density function. This is how we get the probability density function of a gamma distribution.

${d\over dx}P(X\le x)={\lambda\over (k-1)!}(\lambda x)^{k-1}\exp(-{\lambda x})$

### Normal Distribution

All of those distributions, if you take to the limit, in a sense, is going to be normal distribution. The first moment is the expected value of random variable X, and the second moment is the expected value of X squared. The higher order moment is the expected value of X cubed, and so on, and so forth. What we know is, if we take the random variable to some sequence of random variable to some limit, then the higher order moments will all disappear, and we’ll end up with only having first moment, and the second moment, and that is how we get the normal distribution.

$f(x) = {1\over \sqrt{2\pi}\sigma}\exp(-{1\over 2}({x-\mu\over\sigma})^2)$

– ${\bf E}X =\mu$

– $\sigma^2 X = \sigma^2$

Central Limit Theorem

– Let $X_1,X_2,\dots$ be i.i.d. with $\bf EX_1=0$ and $\bf EX_1^2=\sigma^2<\infty$

– Then $\frac{S_n}{\sqrt n}=\frac{\sum_{i\le n}X_i}{\sqrt n}\to\cal N(0,\sigma^2)$

Proof: moments higher than 2 converges to 0

– $\phi(t)=\bf E e^{itX}=\phi(0)+\phi'(0)t+\frac{1}{2}\phi”(0)t^2+o(t^2)\to 1-\frac{\sigma^2 t^2}{2}+o(t^2)$ as $t\to 0$

– $\bf Ee^{\frac{itS_n}{\sqrt t}}=(\phi(\frac{t}{\sqrt n}))^n\to e^{-\frac{1}{2}\sigma^2t^2}$ as $n\to\infty$