Statistical Physics [11/17/2015]

Thermodynamics of complex systems

The topic of this lecture is  Introduction of Statistic Physics. Statistic Physics is a course.In Statistical Physics, one thing that people do is to explain the microscopic phenomena, microscopic interaction of the agents of the atoms and how this was accomplished and how we expect to accomplish those things in our study of our social systems. In Statistic Physics, what people normally do in Physics is we use the least amount of laws to explain the most amount of phenomenons that we encounter.

The beauty of this is that if we only have the least amount of laws, then our possibility of making errors will be minimized, and if we know that those laws have been tested or have been refuted by researchers of many generations, so this means that there is something about those laws that we can rely on. In terms of the thermodynamics, people previously didn’t know what was happening. It turned out to be something statistics.

Prior to the 20th Century, the ideas that people normally think that given us enough measure or enough tool, sufficient tool to observe everything, we’ll be able to explain everything through Newton’s three rules, but after the beginning of the 20th Century, people began to think of information and think of statistics. The prevailing idea including the idea of today is that we’re not able to observe everything about our system, so this means that we will have to talk about probability. What is the probability for us to observe certain things?

In terms of the thermodynamics, what people have observed they include the following, so we have Joe. Previous people didn’t know what is heat so some people say that heat is perhaps some kind of substance like atoms that move from one place to another place. This turned out to be not true because we can construct many counterexamples. For example, when we have ice melting into water, it seems like everything is the same and so where is those atoms or where did those atoms … from, right? This is one example.

Another example is that it seems that people found that people could generate infinite amount of heat, so this means that there’s no conservation law about the number of heat because we can generate the heat infinitely. There was this person named Joe. Joe constructed a way to turn physical energy into heat and he also constructed a way to turn electrical energy into heat. This means that those things could be interchanged. Another finding that was pretty important is this Carnot Circle four stages.

We could use those four stages to either translate some heat into mechanical energy or translate mechanical energy back into heat. Based on all of those things, what people found out is ultimately something that look like the following. We have the internal energy of the system and this energy is a functional volume of course because if we compress the system, if we expand the system, we change the configuration of the system and we change the physical composition. There’s another observation about thermodynamics physics, which is that a system always goes … Heat always goes from high temperatures.

If we have two systems together, heat goes from the subsystem with higher energy to the subsystem of lower energy, so this means that there is a flow kind of thing in terms of heat. Based on all of those things, people ultimately find things in this kind of fashion. Here we’re using first order approximation. This is in terms of temperature, temperature times the change of entropy and the pressure times the change of volume. What people think here, the mind frame is that … First, it is the first order approximation and second, we have two different types of variables.

One type of variable is called the intensive variables and another part of variable is called extensive variables. When we combine subsystems into one, the value of an extensive variable of the combined system equals the sum of values of the subsystems,  the differences in intensive variables between subsystems drive the change in corresponding extensive variables. Extensive variables include things such as weight. If we put two systems together, then the weight of this combined system is the weight of those two subsystems. Another kind of variable is called the intensive variables. For example, pressure is intensive variable because if we put two subsystems of different pressures together, then the difference of the pressure will drive the movement of the molecules from the subsystem of a higher pressure to the subsystem of lower pressure. There are several things that are very frequently mentioned or very frequently used. For example:

$d U(S, V) = \left({\partial U\over\partial S}\right)_V d S + \left({\partial U\over\partial V}\right)_S d V$,

where temperature $T=\left({\partial U\over\partial S}\right)_V$and pressure $p=-\left({\partial U\over\partial V}\right)_S$ are intensive variables, entropy and volume are extensive variables.

There are several things that are very frequently mentioned or very frequently used.

First, which is the $\textbf{free energy}F = pV = U – T S$ is Legendre pair of internal energy $U$, and $T$ and $S$ are the Legendre conjugate variables.

We know that it’s pretty hard to turn the energy associated with entropy into mechanical energy that we can use. Part of energy associated with entropy is something that we cannot use, but the rest part of the energy is something that we can use. We call that part of energy “free energy.” If we think of the form of free energy, so what we find out is that it has multiple occasion of intensive variable and extensive variable. Then we have two things, one is free energy and one is this internal energy. If we recall our lecture on convex optimization and convex deal and variational method, what we find out is that the free energy is a Legendre-Fenchel pair of the internal energy.

Second,which is the $\textbf{Enthalpy}H(S,p) = U(S,V) + p V$ is Legendre pair of internal energy $U$, and $p$ and $V$ are the Legendre conjurage variables. Entropy which is the part that we cannot use, entropy is the Legendre-Fenchel pair of the internal energy and the Legendre conjugate variables are $P$ and $V$.

Similarly, $\textbf{Gibbs free energy}G(T,p) = U – T S + p V$ is Legendre pair of internal energy $U$, with $(T,p)$ and $(S,V)$ being Legendre conjugate variables.Gibbs free energy is the convex pair. We also it call Legendre pair convex pair. Gibbs free energy is the convex function pair of the internal energy with the conjugate variables being two-dimensional. We have on the one hand $T$ and $P$, both are “temperature” and “pressure.” Both are intensive variables and we also have $S$ and $V$, which is entropy and volume. Both are extensive variables.

Now the question is that, so it is not very surprising that here we have the convex optimization here in our variation method because the idea of convex optimization ultimately came from here. Similarly for example, we think of graphic models that we talked about previously in this class and the idea of graphic models is also from Statistic Physics. Here I just want to review on the interpretation of the Legendre Transform. We have two so we’re given a function $F$ and $F$ is a convex function.

We define the Legendre Transform :

$f: I\to \mathbf R$ is a convex function.

$f^*(x^*)=\sup_{x\in I} (x^*\cdot x-f(x))$is its Legendre transform.

so a Legendre transform has different variables. We define the Legendre Transform of this one at point $x^*$ is the supreme of when we multiply the intensive variable with the extensive variable. We multiply a variable with its pair and subtract function $F$ from the variable and take the supreme. We get the Legendre Transform at the conjugate variable. If we think of, in this form, we have on the $X$ axis we have variable $V$ and on the $Y$ axis we have variable $P$.

We can express variable $V$ in terms of $P$, and we can express variable $P$ in terms of variable $V$, and we have the area underneath this line here, and this is monotonically increasing function. The area under this line here, we express this as $L$. The area to the left-hand side of this line we express this as $P$. What we have is that $L$ is the integral from zero to $V$. Understand $L$ is the area underneath and $H$ is the area to the left. What we have is $L$ of $V$ plus $H$ of $P$ equals $P$ times $V$. If we think of it in another way, so what we have is $H$ of $P$ equals $2P$ times $V$ minus $L$ of $V$.





$L(v)=\int_0^v p(v’)dv’$ $\Rightarrow {\partial L\over\partial v}=p$

$H(p)=\int_0^p v(p’)dp’$ $\Rightarrow {\partial H\over\partial p}=v$

$H(p) = p\cdot v-L(v)$

$L(v)+H(p) = p\cdot v$

This is another interpretation of the Legendre Transform. What we want to notice is that we have two functions, $L$ and $H$, and $L$ and $H$ are Legendre pairs, and $L$ and $H$ take different variables. If we think of Legendre Transform of function that takes a different state of variables, so not necessarily the variables of the original functions. This is one interpretation of the Legendre Transform. Actually, they have names. $H$ is what people normally call energy, so Hamiltonian and $L$ is the Lagrangian so $L$ and $H$ have different names.



function $y=f(x)$ corresponds to

envolop $y=p\cdot x-f^*(p)$ where

$p=f'(x)$ is slope of $f(x)$ at $x$

$f^*(p)$ is intercept determined by $p$

$f^*(p)$ is convex in $p$

$f(x)=f^{**}(x)$ if $f(x)$ is convex

if we compare two. People first understand thermodynamics before people think of it in terms of Legendre Transform. We have the science of those things that look pretty weird but people first have the equation of the increment of energy and then people later find out that all of those things have the same form which is in terms of the Legendre pairs. If we take $T$ and $S$ as Legendre conjugate variables, we get free energy. If we take another set of Legendre conjugate variables we get another pair of Legendre functions.

This the main interpretation that we talked about before and in this one we have a function $Y$ here and we want to find out the Legendre Transform in this one, so what we do is we first give a envelope and then we move the envelope up until we first touch the line of this function and this function is a convex function, so this slope will touch the function at some time for the first time. At the time that this slope which is identified as $P$, at the first time that the slope touches the function, we use the intercept at the $Y$ axis as the function; that is the Legendre pair corresponding to the slope.

This is another interpretation of this idea. Those interpretations are actually the same. What we have is $F$, so the Legendre function at $P$ plus the original function at $X$. We know that $P$ and $X$ are Legendre pairs. If we multiply those things together, then the total length is this one which is $X$ times the slope, $P$ of $X.$This is another interpretation of this. As we mentioned, people have found many interesting things about the thermodynamics but the question people had was, how do we explain the thermodynamics using the least amount of laws, the least amount of assumptions to make an interpretation of this?

Boltzmann relation and partition function


Here $E$ is a scale of variable which is the energy but $E$ could be other things. It could be a long feature, so what we know is that maximum entropy probability given as the set of features, the sufficient statistics $E,$ is going to take a exponential form. We have a set of parameters, the natural parameters corresponding to those features, beta and $E.$ Given $E$ we can find out beta and given beta we can find out $E.$ According to Boltzmann the probability of a system to be in energy, $E$ of $J,$ so $J$ is an index of the energy, so if we think of a system with a finite number of molecules, then the number of energies is finite and we can actually number those energy configurations.

According to Boltzmann it looks like the following form.

$P_j = \exp(-\beta E_j) / Z$, where – $P_j$ is the probability of observing the $j^\mbox{th}$ state with energy $E_j$

$Z=\sum_j \exp(-\beta E_j)$ is the **partition function**.

$Z$ here is called the partition function and it is the normalization constant to make it probability. The conventional way and the classical way to demonstrate that we have this distribution is the following. We consider a small cylinder and this cylinder is what we consider and we put this small cylinder in a heat reservoir. We have considered this small cylinder there and energy as $E$ of $J$ at this, it is in energy configuration $J$ and the total energy.

The energy here is energy of $R$ which is the total energy minus $E_j$ so the total energy here is $E$ and we assume that the total energy in this duration, we assume that the total energy of the whole system of reservoir and the small tube is constant so $E_R$ plus $E_J$ equals to $E$.

The following one is a very critical assumption which says that in a system all configurations are equally like. Of course we have constraint, which is all configurations should have the same energy, which is energy $E,$ right? This is very critical and this is actually what we have previously identified as maximum entropy. In a system, okay, so maximum entropy, all configurations are equally likely.

$\log P_j \propto \log\Omega(E-E_j) = \log\Omega(E)-\beta E_j+\cdots \Rightarrow P_j\propto \exp(-\beta E_j)$. Now what we do is to take the first order approximation so around the energy “E,” so we have log of “E.” This is the first energy turn here and beta naturally comes out from this as the partial derivative of the log partition function. I should say log number of configuration over energy and from here we can say that the probability for the system to be in energy “I” is proportional to some exponential form over there.

Link to thermodynamics


${\partial \log Z\over \partial \beta} = -\sum_j E_j P_j = -U$, where $U$ is internal energy

${\partial \log Z\over \partial V} = {1\over Z} \sum_j {\partial\over\partial V}\exp(-\beta E_j(V))= \beta \sum_j p_j P_j = \beta p$ where $p_j$ is the **pressure** associated with the $j^\mbox{th}$ state.
– $d\log Z= {\partial\log Z\over\partial\beta} d\beta + {\partial\log Z\over\partial V} d V=-U d\beta+\beta p d V = -d(\beta U) + \beta \underbrace{(d U + p d V)}_{T d S}$, or $T d S = {1\over\beta} d (\log Z+\beta U)$, or $S=k \log Z + {U\over T}$, where $k={1\over\beta T}$ is Boltzmann’s constant, or $U-T S=k T \log Z = F$ is free energy. – $S = k\log Z + k\beta U = K \log Z + k \sum_j \beta E_j P_j = -k \sum_j P_j \log P_j$ entropy relationship.

Statistical Physics [11/17/2015]

Sample from Probability Distributions [08/10/2015]

Today our topic is how we sample different probability distributions. It is always good to know how we sample them especially if we work with some, for example some platforms that does not have the corresponding distribution defined or does not have a function that could enable us to get different properties. The first way that we sample our probability distribution is through inverse transform. The inverse transform works in the following way. We have a random reliable which is X. Here, $F(x)={\bf P}(X\le x)$ represents the cumulative distribution function.

What are the properties of  $F(x)$?

It is a function. First it is a function between zero and one, because if $F(x)$ is the probability, X less than or equal to X. The probability for this event to happen. If we talk about probability, a probability is something that is, according to that one equal to zero and smaller or equal to 1. That’s why this is always between zero and one. The second property is that, this is always now decreasing and the reason is that, by talking about probability, we talk about positive, a probability mass function, density function, those are positive. This means that we do not cope with negative probability or apart with negative probability.

It is always increasing. We also know that as X tends to minus infinity we have zero, because nothing hides as a minus infinity. If we go to, if we proceed to positive infinity we get 1 because assuming because nothing hides there. That is the probability, cumulative probability function.


What is Inverse Transform?

$F^{-1}(y) = \inf\{x|F(x)\ge y\}$ where $0\le y\le 1$ is the inverse of $F$

What we do in the method of inverse transform, is that, we first take the inverse of $F(x)$. After we take the inverse of $F(x)$, so it is something F minus one to Y and that is defined as the infimum or minimum, the smallest number Y. The infimum of let’s say, the infimum of X, such that, $F(x)$ is greater than or equal to Y.

It’s not a problem if we cope with a continuous function but, if we cope with for example, function that hits support. A finite number of, that takes support on a finite set, then we’ll have to differentiate between minimum or maximum. For example, this one what we have is, I think, whenever we have a mass, we have a junk here, it is something that is, let’s say, that is left continuous.

Here is the inverse transform of uniform random variable $U$ on $(0,1)$

$\bf P(x)=F^{-1}(U)$ follows the distribution of $F(x)$


&{\bf P}(\{F^{-1}(U)\le x\}) \\
= & {\bf P}(\{U\le F(x)\}) \\
= & F(x)

We first saw, we know that the cumulative distribution function is from zero to 1 and what we do is we first sample on the Y axis. After we sample on the Y axis, we find out the corresponding value on the X axis and we take this X and this X has the cumulative distribution function, that is $F(x)$.

This is another example.

$F(x)=1-\exp(-\lambda x)$ is cumulative distribution function of a exponential random variable.That is how we get a sample from explanation distribution through this method of inverse transform.

$\bf P(x)=F^{-1}(U) = -{1\over\lambda}\log(1-U)=-{1\over\lambda}\log U_1$

Here, I’m sampling from a randomized a uniform distribution. I’m sampling from a uniform distribution and then I make the transform and then I compare the quantiles of those two functions and we find out that the quantiles house are equal. This means that, this just intuitively help us to verify that. This is the right way to sample a function where if no one cumulative distribution function from a uniform distribution.


Another way to sample from a distribution is, a method called acceptance, rejection. Suppose that we have a probability density function which is $g(x)$. The random variable $Y$ has this probability of density function. Suppose that whenever we sample this $g(x)$, we either accept, no matter whether we make a sample of $Y$, we either accept this $Y$, or we reject this I. The probability of accepting this $Y$ is $h(y)$. Here, what we want to show is that, the result of accepting the $Y$ is a random variable and it has probability density function which is $g(x)\times h(x) $. The reason is the following.

The cumulative distribution of function of $x$ is going to be the $P(X\leq  x)$ . That one has probability density function which is $\int_{-\infty}^x g(x) dx$.Accepting Y could be, we could have different radar of acceptance corresponding to different values of this Y. We need to multiply this one by $H(x)$ and that is the probability but then we need to normalize this one.

We need to multiply our normalization constant and the normalization constant is going to be this one from $\int_{-\infty}^X h(x)g(x) dx $ We know that the sum is going to be $\int_{-\infty}^\infty h(x)g(x) dx$. That’s how we claim that the probability density function of $Y$ is $g(x)\times h(x) $.

In order to do that, we first sample from a uniform distribution, distributed or supported by the interval zero and one and then accept this sample point of uniform distribution according to this function. After we do this, we know that the result will be a beta distribution. Here, I just, you run the code and here I have plotted the real distribution of $B(4,3)$ here and then I have plotted the distribution that I have estimated from acceptance rejection method, which is the red line. I’ve also created another way of sampling a beta distribution which is through other statistic.code


We can see that those three lines are pretty close and as the sample size turns to infinity the red line and the green line should approach this black line here. You can study the code afterwards.


There are also other methods of sampling distribution. Let’s suppose that we have cumulative distribution function which is $F$.

$F=\sum_1^k p_i F_i$ is cumulative distribution function, where $\sum_i p_i=1$

$F$ is a mixture of $K$ distributions and $F_1$, to $F_k$. We can decompose $F$ into a mixture of distributions, then we can sample this $F$ by first sample from the category. Which category will we sample?

After we sample from the category, like we know that we want to sample the Jth category then we sample $x$ from $FJ$. $X$ has cumulative distribution function which is $F$ because, if we find out the cumulative distribution function for $X$. Here we apply, we replace this, the probability $J$ is equals to $j$ with the probability for us to take a sample from the Jth component, we get $F(x)$, which is what we require.
– We first sample $J$ from categorical distribution $(p_1,\dots,p_k)$ then
sample $X$ from $F_J$
– $X$ has the distribution defined by $F$
$P(X\le x)=\sum_j P(X\le x|J=j)P(J=j)=\sum_j P_j(X\le x)p_j=F(X)$

Example of sampling through the method of composition is that, suppose that we want to make a sample that is pretty irregular, looks like the following.

$f(x) = 0.5+2\cdot (1-0.5)x$

From zero to 1 and it takes, then we have this one here and then we have this one here. Because we, this is the PDF and we require the area, is under this line to be one. We know that it’s PDF of a continuous random variable supported by the interval zero and one.

One way to sample this one is to find out if it is a composition of two distributions. In order to do that, we first make a sample of which distribution you want to sample. Either you want to sample from first distribution or you want to sample from second distribution. After we find out, we sample which distribution we want to sample proceed to either sample first distribution or we sample second distribution. The key to implement is actually pretty easy. We can study this code.



Special Properties

There are other ways to sample a random variable and this is according to the properties of a random variable.

Representing random variable $X$ in terms of other variables easily generated. For example:

– Sample from **binomial distribution** as the sum of i.i.d. Bernoulli trials
– Sample from **Poisson distribution** as the number of i.i.d. realizations of an exponential distribution before total time exceeds unit time
– Sample from **Erlang-$k$ distribution** as the total time of $k$ i.i.d. realizations of an exponential distribution
– Sample from **beta distribution** $B(k,n+1-k)$ as the $k^{\rm th}$ smallest number in $n$ numbers uniformed distributed in $[0,1]$

Let’s talk about sample from a Poisson distribution as the number of ideal realizations of exponential distribution before we have reached a certain amount of time.

The reason is that, a Poisson distribution is the distribution of the number of Poisson events. Meaning that if we put the intervals of exponential distributions, heads to tale together until we reach, until we get from one end of interval to other of the interval. We count the number of intervals here, inside this small intervals here, inside this interval and the number as a poisson distribution.

Similarly, this distribution that people normally use to model the origin of phone calls. We can sample our own distribution as a total time of ideal realizations of exponential distribution. This is an example that we sample Poisson distribution according to the property of Poisson distribution. Here X is a sample of Poisson distribution from exponential distributions and Y is a sample from the Poisson distribution directly. If we plot the quantile plot of those two, we find out that they are pretty similar and they’re the same type of distributions. They are also other methods to sample from a probability distribution which we’ll talk about later.




They are also other methods to sample from a probability distribution which we’ll talk about later. For example, we have, you might have already heard of it, so which is give sampler or we might use something that is called metropolis hastings sample which is very similar to this rejection acceptance sample that we talked about before.

In order to understand those two ways to sample from a random variable, we first need to understand on some concepts about stochastic process because those two ways of sampling from random variable probability distribution depends on asymptotic properties of a stochastic processes. Meaning that, we sample a process, this is a random process, again and again until we lose all memory of where we have started and at that time, we know that the distribution of the theme of some statistic that we sample observes the distribution that we are interested in.

We have talked about many different ways of sampling from probability distributions. Even more important thing is, if we want to work with probability, we first want to know which probability distribution should we choose. It’s a random variable of exponential distribution or is it Poisson distribution or is it a random variable of some other distributions. This involves statistics or testing. We have a hypothesis which is that, a random variable observed some distribution and we want to test whether it is more likely that this random variable has this distribution or less likely and reject the unlikely cases.

Here’s just an example of hypothesis testing. A hypothesis testing involves the following concepts.

– **belief/null hypothesis**: a coin is fail
– **experiment**: toss coin for 10 times, resulting a **sample** of 10 tosses
– **test statistic**: number of heads.
– **significance level**: probability of rejecting hypothesis/belief when it is true
– **cretical region/region of reject**: reject hypothesis when statistic falls into this region.

In order to conduct the hypothesis testing, we first need to have a belief about what we have here, without the belief we cannot do anything and we also call this belief a now hypothesis. Like, if we want to test whether a coin is a fail coin meaning that we get the same, approximately the same number of heads and tales or I say it another way, whether we get the same number of heads and tales if the number of samples sides turns to infinity.

Suppose that this is our hypothesis and after we have a hypothesis, we proceed to conduct experiment. An experiment involves that we toss these coins for ten times, like ten times, twenty times that’s not infinite amount of times. After this experiment we get a sample which is ten tosses of a coin. After this sample, like we can get head, tale, head, tale, something like that. After this test, after we get this sample, we want to conduct some, we want to come up with some test of statistic. One intuitive way of finding out whether it is more likely or less likely for us to come out with a sample is we find out the probability for us to get the sample but it will not work.

The reason is that, under the now hypothesis that we have a failed coin, let’s say, I think there’s a type. It’s not fail, a coin is a fail. It’s a type of coin that’s fail. One intuitive idea, we want to find out the probability for us to get a sample of ten tosses. The higher the probability, the more likely that the hypothesis is true. That will not work because suppose that we have a failed coin then the probability for us to get any sample of ten tosses is going to be ten to the power of, let’s say point five to the power of ten, so it will not work. There is another way, which is, which involves just statistics.

If we say our test of statistic is the total number of heads in the ten tosses, so this looks more promising. If a coin is a failed coin, then we’re likely to get five heads and five tales or four heads and six tales or six heads and four tales and it is very unlikely for us to get one head or two heads, or nine heads or ten heads.

The reason is that, by using the test of statistic which is the number of heads, we have a binomial distribution, something like that. We know that this area’s pretty likely and this area is pretty unlikely and this area is pretty unlikely. Now we find some promising test of statistic. The important thing for this test of statistic is that, we know the probability distribution of this test of statistic under the assumption of the now hypothesis, which is that, the distribution is binomial distribution.

We can proceed to reason about whether the hypothesis, the now hypothesis is more likely or less likely. As we said if it is in this area, and from this area and from this area, it is less likely. No matter what corresponding to each significance level, we always get some kind of critical region of rejection. If it is on this or on number of cases, in this region number of cases in this region, we reject our now hypothesis which is that the coin is a fail. Other wise we say that the hypothesis is compatible with our sample or our sample is compatible with the hypothesis. First the way that we choose significant level is arbitrary. We can choose what type of or significant levels we could choose for example 1.0, 1.001 and corresponding to different significance level we have different critical regions, a region of rejection.

In doing the statistical test, we always have the risk of either rejecting this hypothesis when it is correct or accepting this hypothesis when it is wrong. Like for example, when the hypothesis is correct and the coin has indeed failed, there is the possibility for us to get into this area and to get into this area always exists but we rejected those things. In addition, if the hypothesis that the coin is failed, is wrong, the coin is not failed, it is possible for us to get into this region and we get a number of answer which is in this region we have to accept the hypothesis as something that is compatible but it is wrong.

We’re coping with probability here and actually I just revealed test. Actual thing that I want to talk about is that, since we’re working with probability distribution, we always care about how good a sample fits probability distribution. We have a sample, we want to know whether the probability distribution is a good fit to the sample or a bad fit to the sample. This is a standard Gaussian distribution. I randomly sampled ten points from distribution. Another Gaussian distribution that means zeros and standard deviation, one which is, has the same distribution with this one. We can say that the sample represented by and the red dots looks pretty compatible with this distribution. We have made another sample which according to normal distribution which has mean, average of value one and standard deviation, point five which is represented by and the blue bus and we can say that the sample represented by the blue bus is a bad fit to the standard normal distribution here.

We can talk about different things. The first thing is we can talk about parametric. We can conduct parametric goodness of the test, which says that, suppose that we know that our sample is from a normal distribution, then whether or not we have got the right mean value of the distribution. Whether or not we have got the right variance or standard deviation in our distribution. This involves parameters, so we call it parametric test. Similarly to our example of testing whether a coin is a fail or not, one important thing is that, in order to check a test of statistic, we want to know the probability density function or the cumulative probability function of this test of statistic. Here for example, if the test of statistic is the average value, then we want to first find out the PDF or CDF of this test of statistic before we can set out a level of acceptance and find out the critical region or the region of rejection.

Here, if we’re talking about the, for example if we know that our sample is from a Gaussian distribution, and if we know the variance of this Gaussian distribution, then our estimator which is … $X_1$ to XI are random variables. Our estimator is far as some function of those random variables thus it is also a random variable. Whenever we talk about random variable, we can talk about the distribution, right, the probability of this random variable.

Let’s say, since we know that those X are taken from a Gaussian variable, and there’s a property of Gaussian random variable, which is a linear combination of some of those Gaussian random variables is also Gaussian random variable. We know that this one is a Gaussian random variable and the minimum of this Gaussian random variable of X [ ] of E. E is going to be E1 over N. We have this one so we first take this N out and then we take this submission sign out of this E so we get to this one. Name of SI and since we know that all of those are example of the same distribution, so we know that this is going to be times $N$ times nil, so that is going to be nil.

That is the average value of this $X$. Similarly we can find out the variance of this $X$. The variance of this x but here, yeah. Similarly we can find out the variance of this $X$ and the variance of this $X$ is, let’s say, it’s going to be more complicated but I think it is going to be this one. This means that $X$ bar is going to be a normal distribution with … I just write it this way. $X$ is going to be a normal distribution with this one and this one. We can actually $X$ bar. We can find out, we can actually plot the PDF of this $X$ bar. For example, if $X$ bar, it’s a random variable it can’t be anywhere. If it’s here then it’s more likely. Yes

Goodness-of-fit test

Similarly, so suppose that we do not know the standard deviation of this Gaussian random variable from which we get our sample. Then one idea is to normalize the deviation of this X from by dividing it with the standard deviation. As a result we no longer get Gaussian random variable but we get key statistic that observes another probability the density distribution. This is parametric goodness of a fit test which supposes that, which assumes that we know the underlying probability distribution but we need to fix some parameters and want to check whether the parameters are compatible with the sample. Here we can also talk about number metric goodness of fit. Meaning that we want to fit a sample, we want to see whether a sample is compatible with a distribution. There are two ways to do this. One is what we call Kolmogorov test which is based on …

If we plot the probability let’s say the cumulative probability function, cumulative distribution function of random variable X and a cumulative distribution function of random variable Y, it could be something like this. If they are the same distribution then we could get a straight line here. The deviation of this plot of the sigma of X against sigma of Y gives us some signal about, some indication about to what extent those two distributions of different distributions. The Kolmogorov test, the statistic that is taken back, the Kolmogorov test is actually the supreme, the maximum deviation from the curve. That is, sigma for $X$ against sigma for $Y$ to this line, $Y$ equals to $X$. The interesting thing is that, if the sample size turns to infinity, we actually know the underlying perform of this Kolmogorov distribution.

This is how we test whether two distributions, two random variable observe the same distributions to samples of the same distribution from using Kolmogorov test. Another test that we talk about is this so called the goodness of fit test and that test uses a statistic which is a chi square statistic and which observe a chi square distribution. The chi square distribution is a distribution that generates from the sum of the square of standard normal distribution. We have $X_1$ to $X_N$ which standard normal distribution meaning that there was a thousand distribution with mean zero and standard deviation 1. If we sum the square of all of those distributions together, we get a chi square distribution with the grey of freedom which is $N$. The reason that we can use chi square test to test goodness of fit of many distributions is that, first we can use chi square test to test whether a distribution fits a binomial distribution. Then we can, if we can use it to test goodness of fit of a sample to a binomial distribution then we can extend the result to multi-normal distribution.

When we work with continuous distribution, what we do is to cut the real line into intervals, into beams. Ultimately we end up with working with either binomial or multi-nomial distributions. That is the goodness of fit chi square test. Also, when we work with different distributions, we need to estimate the parameters. We first talk about testing whether the parameters are right parameters and whether the distributions are right distributions. Now we talk about how, that there are many different ways to estimate the parameters.

Parameter Estimation with Maximum Likelihood

One way that has been used quite often is the maximum likelihood estimated. In order to estimate, make the maximum likelihood estimation with the parameters, we first write off the likelihood which is the probability of the sample condition on the parameters and then we maximize the likelihood. In order to maximize the likelihood, we normally take the derivative of this likelihood over the parameters supposing that the likelihood is a continuous function of the parameters. Then we set this derivative to be zero and then we can find out the parameters. What we know is the maximum likelihood distribution is a asymptotically normal, meaning that, when the sample size goes to infinity, we have a normal distribution. It is efficient estimator in the sense that first it is, meaning that the variance of this estimation is a random variable as we do that before. We want to find estimation that has smaller variance than other distributions.

What we know is that, the maximum likelihood estimator which is random variable has the minimal of variance over many other similar estimators. Those I want to talk about in this lecture and in the next lecture we’ll proceed to talk about different stochastic processes as well as how we sample from those stochastic processes.


Sample from Probability Distributions [08/10/2015]