In this lecture we will review some concepts from probability theory. Since we are using probability theory, and in particular we’re using the concept of stochastic process to understand our complex social systems. I think it is better that we can reach some common ground about the concepts of probability theory. Today we’ll begin to involve some math so please bare with me about that. The most important thing is that we use math to do different things, to do important things. My understanding is that there are different people doing different things. There are mathematicians, and there are physicists, and there are engineers.

All people do math but they do math in different ways. If we think about how mathematicians do math, they treat math as a string of symbols. What they do is, they set up some actions and they set up some rules of derivation then they foil the rules then they get pages after pages of symbols and that’s it. That’s how mathematicians do math. They don’t care about the interpretation because once they interpret this then it’s metamath, it’s not longer math. You may ask why they want to do that, just play with the symbols just like artists play with their brush. The reason is that by doing this, they will no longer be constrained by our real world so they’re in a virtual world. Although it’s not very easy, they gain new freedom in their thought experiment in doing new things because in the real world, they are constrained by forces and by physical laws. But in a virtual world defined by the actions and by the laws, they no longer have such constraints so they gain new freedom although it is not easy.

There are also physicists. Physicists also work with math but physicists work with math in a different way. A physicist is always very interested about the intuition about the math laws and the theorems. Whenever they have something, they ask, “What is the intuition and can we visualize that?” What is the force and what is acceleration? The emphasis of physicists is they want to find the intuition, they want to use math to assist them in their understanding of physical laws. This is their way of working with math.

Engineers work still in a very different way. Engineers tend to solve problems with math. This is what we do. One thing that I can say is that whenever we work with math, it’s always that we just build things. I think the most important thing is that we understand how we build those conclusions with math and I will show you the way of building the conclusions. First, it’s nothing that is not understandable. I think that we’re more interested in using math to solve our problems.

### Sample Space, Events and Field of Events

We have heard a lot about probability as random events. Let’s first have some common ground on what is sample space and event, what is a field of events, what is expectation and conditional expectation.

Let’s throw a dice twice. What is the sample space? A **sample space** is the space of all outcomes ($\Omega$). From the sample space, we might outcome $(1,1)$, meaning we get a one from the first throw and get another one from the first throw, and so on and so forth. The size of the sample space is $6\times 6=36$.

If we continue to throw a dice then we get more and more information about the sequence.

- Before we throw the dice, the only
**events**that we can talk about is the set of all possible outcomes or its complement the empty set ($(\emptyset, \Omega)$). - After we throw the dice once, we get some information — the information about the first throw: We could get 1, 2, 3, 4, 5 or 6, which could be an even or odd number, or a prime or composite number. Those are all examples of events about the first throw. How large will it be the set of events from the first throw? — It is $2^6$: The event could include 1 or exclude 1, and it could include 2 or exclude 2, and so on. Different inclusions and exclusions give different events. If we consider whether or not this event include 1 then we have two choices. Then if we consider whether the event include 2, we have another two choices. In total, we will have $2^6$ different choices. Those are all different events. The set of events after the first throw includes the set before the first throw. (Verify it!)
- When we throw this dice a second time, we can talk about the outcome of not only the first throw but also the second throw. The set of events after the first throw is a subset of the events of the second throw, which is a subset of the events in the third throw, and so on. (Verify it!) This is how, as we conduct more and more experiments, we’ll gain more and more information about this system and we will have larger and larger sets of events.

The sample space is $\Omega$. An event is any subset of $\Omega$. An **atomic event** is a event that cannot be subdivided, a **compound event** is a union of atomic events, the **complement** of an event is the case that the event doesn’t happen. For example, after the first throw, the atomic events are all the numbers we could read from the dice: $\{1,2,3,4,5,6\}$, a even number is a compound event and its complement is an odd number. The set of all events, which is a set of sets, form a **field**. We call something a field if the following conditions are satisfied.

- The empty sets should belong to the field.
- The sample space $\Omega$ should belong to the field.
- If we have two events $A$ and $B$, the
**union**of the two events (either A or B), the**intersection**of the events (both A and B), and the**subtraction**(A but not B) are also events. For example, one event could be that the outcome is an odd number. Another event could be that the outcome is a prime number, then we can talk about odd prime number, odd or prime number, and a number that is odd and is not prime.

### Random Variables

Up to now, we have defined a sample space, atomic events, compound events, and field of events. We will proceed to define **random variables** or functions defined on a field of events.

In the experiment of throwing the dice twice, we can define a function from our sample space which $\{1,\cdots,6\}\times \{1,\cdots,6\}$, it’s a tuple. We can define a function on this sample space, meaning that we can assign numerical values to different outcomes. An outcome could be $(1,1)$, for example and then we set a function that is the sum of the throws $1+1=2$. Or we could define a characteristic function which equals 1 if the number yielded by the second throw is 3 times the number yielded by the first throw, and 0 otherwise. There are many different ways to define a function on a sample space of finite number of outcomes.

A random variable defines a field. Let us inspect a random variable $X: \{1,2,3,4,5,6\}\to\{0,1\}$ which is the characteristic function for an outcome to be an even number $X(\omega)=(\omega+1) \mod 2$. So, for $\omega$ taking 1,2,3,4,5,6 an $X(\omega)$ takes 0,1,0,1,0,1. What is the field defined by $X$? According to our definition of a field, we have to add in empty set $\emptyset$. We have to add in the sample space $\{1,2,3,4,5,6\}$. We have to add in event $\{X=0\}=\{\omega: X(\omega)=0\}$. The outcomes 1,3 and 5 belong to this set. We have to add in event $\{X=1\}$, which contains the outcomes 2,4, and 6. We can verify that this is the field generated by X: The empty set and the world set $\Omega$ belong to the set of sets. For any event $A$ and $B$, $A \cup B$, $A\cap B$ and $A\\ B$ also belong to a set.

In addition, there are many ways to define a field on the sample space, a random variable may be on field set or maybe not on this field. For example, we have a field of events defined by $X(\omega)$ — that $\omega$ is a odd number or a even number. Is random variable $Y(\omega)=\omega$ defined on the field generated by $X$? It is not because the atomic events ($\{1,2,3,4,5,6\}$) are not events generated by $X$. The field generated by $Y$ includes and is larger than the field generated by $X$. If we look at the field subset relationship in the other direction, we find that the events generated by X, which is $\{X=0\}$, $\{X=1\}$, the whole set $\Omega$, and the empty set $\emptyset$, all belong to the events generated by Y. In this incidence we say that random variable $X$ is $Y$ measurable or is the random variable defined on Y.

For another example, we can define a random variable as the sum of two throws. This sum defines a field generated by events $\{X_1+X_2+=2\},\cdots,\{X_1+X_2+=12\}$. (Verify this!)

It is a little complicated to define a random variable as a function defined on a field rather than to directly talk about atomic events and assign a probability measure. It is actually necessary if we want to deal with complicated randomness in reality. For example the world economics as our sample space is too complex to be modeled. A random variable X such as “whether the stock price of Apple will rise or fall tomorrow” could be much simpler and manageable. X is a map from the very complicated world into a much simpler one, and X does not capture all randomness about the world economics. To see that X doesn’t capture all randomness in the world of economics, we can define a random variable Y, which is “whether the funding in my retirement account is going to rise or fall tomorrow”, and which is likely independent of random variable $X$ and not defined on the field generated by $X$. In contrast if we confine ourselves to only work with the values of X as our sample space, we cannot describe something as complex as world economics. In addition since X captures all randomness that we know of, we cannot defined a random variable $Y$ from the sample space defined by the value of $X$ that is independent of $X$. That is the reason why we bother to first define omega which is sample space and then define a random variable X which maps this sample space into a simpler space. We can start from some very complex world and we chip something out to study which is simple and manageable.

### Probability, Expectation & Conditional Expectation

A probability measure is positive numbers assigned to the atomic events in a sample space. The sum of the probabilities assigned to the atomic events should sum up to 1 because we do not want to leave out something.

Let’s assign equal probability the 36 possible outcomes from two throws of a dice. The probability for each outcome is 1/36. Now that we have assigned probability to all of the atomic events, we can talk about the probability assigned to compound events, which is the sum of the probabilities assigned to its elements. For example, what is the probability that the sum of the two throws is 2? It is 1/36 because the only possible atomic event to generate a 2 is $(1,1)$. Similarly the probability for the sum of the two throws to be 3 is 2/36, and the probability for the sum to be 4 is 3/36.

After we have assigned probability to the sample space, we can estimate the **expectation** and **variance** of a random variable. (A random variable is a function that assigns a value to an outcome of a experiment.) What is the probability mass function of $X$, that is the sum of two throws, supposing that every one of the 36 outcomes are equally likely? The outcome of the sum of the two throws is from 2 to 12 (where 12 corresponds to $(6, 6)$), with each outcome $x$ assigned a probability that is the number of occurances of $x$ in the following table divided by 36. The sum of the probabilities sum up to 1.

$$\begin{array}{c|cccccc}

+ & 1 & 2 & 3 & 4 & 5 & 6\\

\hline 1 & 2 & 3 & 4 & 5 & 6 & 7\\

2 & 3 & 4 & 5 & 6 & 7 & 8\\

3 & 4 & 5 & 6 & 7 & 8 & 9\\

4 & 5 & 6 & 7 & 8 & 9 & 10\\

5 & 6 & 7 & 8 & 9 & 10 & 11\\

6 & 7 & 8 & 9 & 10 & 11 & 12

\end{array}$$

What is the expectation of X? The expectation of this random variable is the weighted sum over all possible outcomes of X, where the weight is the probability ($E(X)==\sum_\omega X(\omega)P(\omega)=\sum_x x\cdot p(x)$). If we sum these together, we get 7. What is the variance of X? The variance of X is the weighted square of the difference between X to the average value of X, where the weight is the probability($\mbox{Var}(X)=\sum_x (x-E(X))^2\cdot p(x)$). We can show that $\mbox{Var}(X)=E(X^2)-(E(X))^2$.

Similarly, if we define a random variable $g\circ X$ by function composition, where random variable X is already a function. For example, if X is identity function $X(\omega)=\omega$, $g\circ X(\omega)=2\times\omega$ for $g(X)=2\times X$, and$g\circ X(\omega)=\omega^2$ for $g(X)=X^2$. We can estimate the expectation of function composition $E(g\circ X) = \sum_\omega g(X(\omega))p(\omega)$, where $\omega$ takes atomic events in the sample space. There is a simpler way to estimate expectation: $g\circ X=\sum_x g(x) p(x)$, because the field generated by $X$ is a subfield of the field generated by all $\omega\in \Omega$.

Now we talk about **conditional expectation**, **conditional probability** and the **conditional independence**. Recall that we assigned probability 1/36 to each of those 36 events. From this probability assignment it follows that the probability for the first throw and the probability for the second throw are independent. Here is the reason. We know that $P((m,n))={1\over 36}$, $P((m,\bullet))={1\over 6}$ and $P((\bullet,n))={1\over 6}$ by our setup. Hence $P((m,n))=P((m,\bullet))\times P((\bullet,n))$. This means that given the result of the first throw does not help us much to make inference about what will be the result of the second throw. That’s the first throw and the second throw as independent based on our assignment of 1/36 to the atomic events.

For another example, conditioned on that is the sum of the two throws is 3, the probability of the first throw to yield 1 is 1/2 and the probability of the first throw to yield 2 is also 1/2. The reason is that we only have two cases that we can yield a sum of 3, which is $(1,2)$ and $(2,1)$.

Below are several important properties of expectation and variance:

**Expectation of linear combination**Let $X_i$ be random variables, then $\bf E(a_0+a_1 X_1+\dots + a_n X_n) = a_0 + a_1\bf E(X_1) + \dots + a_n \bf E(X_n)$. (Why?)**Expectation of independent product**Let $X$ and $Y$ be independent random variables, then $\bf E(X Y) = \bf E(x)\bf E(Y)$. (Why?)**Law of total expectation**Let $X$ be integrable random variable (i.e., $\bf E(|X|)<\infty$) and $Y$ be random variable, then $\bf E(X) = \bf E_Y(\bf E_{X|Y}(X|Y))$. (Why?)**Variance of linear combination**Let $X_i$ be random variables, then $\bf{Var}(a_0+a_1 X_1+\dots + a_n X_n) = a_0^2 + a_1^2\bf{Var}(X_1) + \dots + a_n^2 \bf{Var}(X_n)$- $\bf{Var}(X) = \bf E(X^2) – {(\bf E(x))}^2$

A stochastic process evolves and we get more observations about it over time. The future observations are conditioned on the current and the past observations. That is how this conditional probability and conditional expectation emerge and why it is important in studying a stochastic process. Let us work with conditional expectation in a finite sample space $\Omega$. The field of events is the power set $2^\Omega$ or the set of all subsets of $\Omega$. We are given a field $\mathcal G$ generated by a partition $(D_1,\cdots,D_k)$ ($\cup_i D_i=\Omega$, $D_i\cap D_j=\emptyset$ for $i\ne j$). We are given a random variable X which assigns the atomic events to the atoms to a number $x_1,\cdots, x_p$. The **conditional probability of a event A given event D** is the probability of A and D divided by the probability of D, or the probability for A and D to both happen conditioned on that probability D happened. For example the conditional probability for the sum of two throws to be even ($X_1+X_2 \mod 2=0$) conditioned on that the first throw is a prime number ($X_1$ is prime). The **conditional probability of event A given a field $\mathcal G$ **is the probability of an outcome $\omega$ to be in $A$ conditioned on that we know which $D_i$ this $\omega$ belongs to: $P(A|\mathcal G)(\omega)=\sum_i P(A|D_i)I_{D_i}(\omega), where I is an indicator function and $d_1$ to $d_k$ make a partition of the whole sample space. So conditional probability is a random variable. For example the conditional probability for the sum of two throws to be a even number conditioned on that we know whether the first throw is a prime number or a composite number — The conditioned random variable (or field) represents things we know and the conditioning variable represents things we do not know. Similarly we can also define **conditional expectation** of a random variable given a field $E(X|\mathcal G)=\sum_i x_i P(X=p_i|\mathcal G)$, and a conditional expectation is also a random variable.

**Continuous Probability Space**

We talked about how people work on a space with finite number of values. Now we move from finite to infinite. The way that we do this is actually very simple: We approximate the infinite in terms of finite, and take the limit. That is, we aim to approximate the probability of a measurable set in a continuous space better and better when the number of “tiles” to cover this measurable set goes to infinite. Similarly, we approximate a measurable functions with a finite sum of indicator functions of measurable sets and aim to improve our approximation of the continuous function as we sum over more and more indicator functions. A random variable is just a measurable function that maps a probability space into a measurable space.

When we move from a sample space with finite outcomes to one with continuous outcomes, we need to work with a **sigma field**. A sigma field is a field with the capability to move us from finite to infinite.In a $\sigma$-field, disjoints measurable sets are countably additive, meaning that we can add countably number of sets.

Let us sample a standard normal distribution using $X=\sum_{i=1}^{12}(U_i-{1\over 2})$, where $U_i$ takes uniform distribution on $[0,1]$. We can show that $X$ has mean 0, standard deviation 1, and probability density function very close to standard normal distribution.

Our sample space here is 12 dimensional $\prod_{i=1}^{12}[0,1]$. $X$ maps the 12 random variables of uniform distributions to the real line. We can define **probability density function** and **cumulative probability function** for real-value random variable $X$.

### Bayesian Theorem

**Bayesian theorem** enables us to turn the conditional around. We have worked with Bayesian theorem in the last lecture about predicting whether from the daily activity of my girlfriend. Here we have another one. Suppose we have a very rare disease and this disease is pretty serious so we want to screen this disease in advance. Let $p$ represents positive clinical test and $d$ represent having disease. Let the probability of positive clinical test conditioned on disease be $P(p|d)=.98$ and let the probability of positive clinical test conditioned on no disease be $P(p|\Omega\\d)=.02$, and let the prior probability of disease be $P(d)=.0001$. If I’m diagnosed to have some positive clinical test result, what is the probability that I have a disease? We will use Bayesian theorem to turn the conditional around, from $P(p|d)$ to $P(d|p)$. It turns out that if I have positive clinic test, the probability for me to have this disease is still pretty small, .1%, which is still not much to worry about. On the other hand, since the prior probability is .001, a positive clinic test increase the probability of having disease by 10-fold.

What is the relationship between mutual independence and pairwise independence? Mutual independence implies pairwise independence but not the other way. In the following example, we have 2 out of 4 faces in red, green and blue, hence $P(r)=P(g)=P(b)={1\over 2}$. We have 1 out of 4 faces in red and green, green and blue, and red and blue, hence $P(rg)=P(gb)=P(br)={1\over 4}$ so we have pairwise independence. We have 1 out of 4 faces in red, green and blue, hence $P(rgb)={1\over 4}$ But since $P(rgb)={1\over 4}\ne P(r)P(g)P(b)={1\over 8}$, we do not have mutual independence.