# Exponential Random Graph Model [10/13]

Le Yang, Harshith Kumar Ramadev and Karthik Kakarla Nagendra Prakash. October 13, 2015

### Introduction to Dynamic Networks

With these dynamic networks, what we can do?

1. Summarizing: What are the characteristics of a dynamic network data? For example, the triangle closure property says if A and B are friends and B and C are friends, then A and C are friends. This property can be summarized from different dataset including some dataset mentioned above.
2. Modeling: how to modeling the dynamic networks? There are two ways: a) from the perspective of economists and social scientists; b) from the perspective of physicists. Both perspectives will be discussed later in details.
3. Prediction: One important objective of research on dynamic network is to predict what will happen in the future. For example, in political relationship dataset, we want to know if one nation will support another nation or fight with it.

Network formation with game theory node/agent based modeling (economists and social scientists)

In this perspective, each node in network represents the individual people. Each people has utility function which they want to maximize. For example, people want to maximize their financial benefit or productivity. More importantly, the utility function is related with not only the resources of the individual but also the properties of the other edge through the edges. For example, people may cost money in forming or maintaining a link while gains profit in having connections to others.

One question is how can the individual agents build edges and get rid of edges to maximize the utilities of themselves. Once we have utility function for each individual, we can think of what network structure will emerge as a equilibrium of this game and what dynamic do we have in network evolution. From the overall perspective, another question is whether or not we have a way to design overall policy over the dynamic network so that the overall utility function can be maximized.

Network formation with tie based modeling (physicists)

First model is random graph model. In this model, a network have parameters N and P which means there are $n \choose 2$ potential edges and each of them might appear with probability P. Obviously, the average number of edges is $N \times P$. Some questions we may ask about this random graph is as follows:

• What is the average path length from one node to another node?
• What is the average distance?
• Whether there will be a lot of small clusters or only few large clusters?

Second model is so-called small world model. First we construct a rigor model where every node is connected with its full neighbors. Then we randomly rewire those edges from one node to any other randomly sampled node in the world. One interesting feature of this kind of random graph is the shortest path from any node to any other node is going to be the logarithm of the size of the world, i.e. the number of nodes in this network.

The third model is preferential attachment model constructed by Barabasi to talk about the fat tail distribution of the edges. The fat tail distribution means some people have a lot of connection and most of the people have very few connection. His idea is to start from some kernel and then grow the network by adding in new nodes. At each time a new node is added, it will establish connection with existing nodes according to the number of degrees. Obviously, rich get richer.

### Markov networks exhibiting Maximum Entropy property

The underlying idea in this part of the discussion is to form a dynamic network, which primarily means constructing a network by adding new edges or removing existing edges. Here we only consider the current state of the system and details about the previous system states are not taken into consideration. This process of starting from an initial state and transforming the network by adding/deleting edges to reach a state of equilibrium/asymptotic distribution demonstrates the principle of Markov networks exhibiting Maximum Entropy property.

When the features/sufficient statistics are known and also the average value of some functions of random variables are given, the probability distribution that best depicts the current state of the system would be having the largest entropy. This statement again is based on Markov property. The corresponding equation that shows this relation is given below:

Maximize $H(X) = – \int f(x)logf(x)d(x)$,

Subject to: $E(f_i(X)) = a_i$ for $i =1,\dots,n$.

Given below are the set of well-known family of probability distributions and their properties:

 Distribution Parameters θ Natural parameters η Base distribution h(x) Sufficient statistic T(x) Log-partition A(θ) Bernoulli p log (p/(1−p)) 1 x -log(1-p) Binomial p log (p/(1−p)) nCx x -nlog(1-p) Poisson λ log λ 1/x! x λ Exponential λ log λ 1 x -log(λ)

Here, we can brief about the Partition function i.e., $A(\theta)$. This function describes the statistical properties of the system in equilibrium state or asymptotic state. Also, we can understand the importance of base distribution i.e., $h(x)$ which gains significance when we are approximating a probability distribution with any other probability distribution of exponential family.

The equation that describes the relation of main statistics, sufficient statistics, Base function and Partition function on a given probability distribution is shown below:

$f_X(x;\theta)=h(x)exp(\eta(\theta)⋅T(x)−A(\theta))$

The directive graph with $n$ nodes and $n \choose 2$ edges which is used to analyze the data about social networks and other real world systems constitute the Exponential Random Graph Model (ERG’s). The edges are tied with average values of features/statistics.

The underlying question in the concept of ERG’s is, “What is the probability distribution of different graphs according to exponential distribution defined by exponential graph model?”.

The answer for this question can be represented by an equation as follows:

$P(X=x|\theta)=P_{\theta}(x)=c \times exp(\theta_1 z_1(x)+….+\theta_p z_p(x))$

The maximum entropy distribution of $p$ features/statistics/invariants is represented by $E(Z_1(x))….. E(Z_p(x))$.

Note that the variable $c$ in the above equation represents the “Exponential of partition function”.

The Sufficient statistic for a family of probability distributions is defined as the statistic defined for a probability distribution which gives all the information as given by a sample from which this statistic was calculated.

Below are some of the examples of scenarios and their sufficient statistics:

One tied scenarios:

1. a) Reciprocity – If A and B are friends, then B and A are also friends.
$M(x)=\sum_{i<j} x_{ij}*x_{ji}$
2. Birds of same feather flock together.
3. Bernoulli – Measured by density of edges. $\sum_{ij} x_{ij}/ \sum_{ij} 1$

Multiple tied scenarios:

1. Triangles – If there is an edge between points i and j, j and k; then there is also an edge between points i and k. The sufficient statistic for the number of triangles is illustrated as follows: $T(x)=\sum_{i<j<k} x_{ij}*x_{jk}*x_{ki}$.
2. Number of k-stars: the sufficient statistic for k-star scenario is given by:$S_k (x)=\sum_{i \in N} {x_{i+} \choose k}$, where $x_{i+} = \sum_{j} x_{ij}$.

In general, given a sufficient statistic and probability distribution, result obtained is a exponential random graph (Static Version). i.e., we will proceed to sample different random graphs according to the distribution defined by the features.

Dynamic ERG’s involve dynamic changes to the graph based only on the current state of the graph to attain the equilibrium state. Hence, we can track the dynamics of evolution of network based on limited observations about the network.

We can use the sampling algorithm to define the probability distribution of any sample path in terms of set of events. The sampling algorithm is as follows:

An ERG is represented by a $M \times M$ matrix, where a value of 0 in a cell shows that there is no edge between two vertices and a value of 1 shows that there is an edge between two vertices.

$X(t_m)$ to $X(t_{m+1})$ for $m=0,\dots,M$.

1. Sample next event time according to exponential distribution and update time t←t+Δt, where Δt ~ Exponential(ρ).

X is a function of time and the way to sample this graph is to first sample the next event according to the exponential distribution (maximum entropy property). The time of the event is calculated based on time at which first success event is observed in any experiment.

1. Sample directed pair $(i,j)$ uniformly and update $X_{ij}$ according to Gibbs Sampling.

Each pair can have two states – Edge or no edge. Using Gibbs sampling, we find the probability of having a edge on all (N-1) edges. This probability will be an exponential distribution. Taking the logarithm of this result will fetch us the linear form equation:

$P(X_{ij} = 1 | X \setminus X_{ij}) = \frac{P(X =\Delta^{+}_{ij}x)}{P(X =\Delta^{+}_{ij}x) + P(X =\Delta^{-}_{ij}x)}$

$log \frac{P(X_{ij} = 1 | X \setminus X_{ij})}{P(X_{ij} = 0| X \setminus X_{ij})} = \sum_k \theta_k (z_k(\Delta^{+}_{ij}x) -z_k(\Delta^{-}_{ij}x))$

where $\delta^{+}_{ij}z_k(x) = z_k(\Delta^{+}_{ij}x) -z_k(\Delta^{-}_{ij}x$ is called change statistic and  ‘+’ indicates there exists an edge and ‘-‘ indicates no edge between two vertices.

The equilibrium can be achieved using the Gibbs sampling approach of sampling path estimation.

Also, given the aggregate statistics such as number of people taking different courses and the time they spend in offices every day, we can perform the parameter estimation to each of these sufficient statistics, which will be explained below.

### Parameter estimation

Note that we already have a way to sample either static exponential random graph model or a dynamic exponential random graph model. So let us now discuss how we can estimate the parameters of those models. First let us try to sample a dynamic ERGM. This approach is called Geyer-Thompson approach which is based on Importance sampling.

1. Suppose that we have observations about the edges, we would like to sample the parameters in such a way that the observations are matched.
2. Sample $x(1),…,x(M)$ according to $θ~= (θ~1,…,θ~p)$. We sample this dynamic network model by first making $M$ samples of $X$ according to some initial parameters, where each $X$ is a matrix i.e. each X is an exponential random graph. Here the initial parameters are the ‘p’ features or ‘p’ sufficient statistics of $\theta$.
3. Once we have the initial sample, we iteratively improve the parameters according to the Newton’s method (Newton-Raphson method).
4. Now take another sample of x and repeat this process until z is maximized. After we have a good enough estimation of theta then we will repeat this process, normally we will need to take another sample because we have many local minima and maxima.

Details about the 3rd step: Newton’s method is a process to find the roots of functions whose graph crosses or just touches the x-axis. Consider a curve f(x) which is shown below. We would like to find the root of f(x) i.e. find the value of x for which f(x) will be equal to 0. We first assign a value to x say $x = x_0$ which is sufficiently close to the root. We then apply first order approximation of $f(x)$ at x = x We get a new value of x say $x_1$. Now we apply the first order approximation of f(x) at $x = x_1$, to find $x_2$. This procedure continues until you converge to a single value which is the root of the function. This is illustrated in the below example. Note that The Newton-Raphson method is just the multidimensional equivalent of this one dimensional Newton’s method.

Example:

1. We initially consider xo to be 6 and apply first order approximation of f(x) at x = 6. We will get the value of x1 as 3.33.
2. Now we apply first order approximation of f(x) at x = 3.33. We get x2 to be equal to 2.27.
3. Now we apply first order approximation to f(x) at x = 2.27. This will give us the value of x3 to be 2.01.
4. Now we apply first order approximation to f(x) at x = 2.01. This will give us the value of x4 to be 2.00. Now if you continue further you will get the same value of x which is 2.00. You can say that the value of x is converged to 2.00 which is the root of f(x).

Let us understand the intuition behind this method: One of the simplest functions one can deal with is a linear function: $f(x)=mx+b$. In particular, if you want the root of a linear function, it’s quite easily figured: $x=−bm$. Now, it is well-known that the tangent line of a function is the “best” linear approximation of a function in the vicinity of its point of tangency:

The first idea of the Newton-Raphson method is that, since it is easy to find the root of a linear function, we pretend that our complicated function is a line, and then find the root of a line, with the hope that the line’s crossing is an excellent approximation to the root we actually need.

Mathematically, if we have the tangent line of $f(x)$ at $x=a$, where ‘a’ is the “starting point”: $f(x) \approx f(a) + f′(a)(x−a)=0$. This is obtained by the Taylor series of the function $f(x)$ by considering only the first order terms. If we want equation in terms of $x$, then $x=a−\frac{f(a)}{f′(a)}$. Let’s call this $x_1$.

As you can see, the blue point corresponding to the approximation is a bit far off, which brings us to the second idea of Newton-Raphson: go for multiple iterations until you the reach the converged point.

As you can see, the new blue point is much nearer to the red point. Mathematically, this corresponds to finding the root of the new tangent line at $x = x_1$: $x_2 = x_1−\frac{f(x_1)}{f′(x_1)}$

This can be written in the form of a matrix. We can keep playing this game (with $x_2, x_3,\cdots, x_n$), up until the point that we find a value where the quantity $f(x_n)\over f′(x_n)$ is “tiny”. We then say that we have converged to an approximation of the root. That is the essence of Newton-Raphson.

The reason we first take a sample is to be able to estimate the sufficient statistics. This is given by $E_\theta(z(x)) \approx [\sum w^{(i)}* z(x^{(i)})]$ for $i = 1, 2, 3, \cdots,m$

This is how we estimate the sufficient statistics. Here $w(m)$: weight of $x_m$ and θ takes value of $\theta^{(g−1)}$ and

$$\begin{eqnarray*} w^{(m)} &=& \exp\left(\sum_{i=1}^p(\theta_i-\tilde\theta_i)z_i(x^{(m)})\right) \bigg / \sum_{m=1}^M \exp\left(\sum_{i=1}^p(\theta_i-\tilde\theta_i)z_i(x^{(m)})\right)\\ D(\theta) &=& \sum_m w^{(m)}z(x^{(m)})z(x^{(m)})^\mbox{T}-\left[\sum_m w^{(m)}z(x^{(m)})\right]\cdot\left[\sum_m w^{(m)}z(x^{(m)})\right]^\mbox{T}\\ \theta^{(g)} &=& \theta^{(g-1)}-D(\theta^{(g-1)})^{-1}\left[\sum_{m=1}^M w^{(m)}z(x^{(m)})-z(x_\mbox{obs})\right]\end{eqnarray*}$$

Here instead of taking the partial derivative of ‘f’ over ‘x’, what we have done is we have used the difference which approximates to the partial derivative. The equations above correspond to the difference between weighted values of the sufficient statistics which should be close to the estimated value of this one under this new sample. By these equations we find the next theta that is closer to the sufficient statistics that we are given. By iterating this we hope that in the end we will be able to get some theta that makes the expected value the empirical average to be close to this given average of the statistics.

Now let us see another approach for parameter estimation which is called Robins-Monro approach based on Metropolis-Hastings sampling. Given below are the steps followed in this approach.

1. We make a sample of X, and then for the sample, we compute the mean value of the features and the statistics of those features.
2. We iteratively sample the next exponential random graph and estimate the next parameters. We continue this process until it converges.
3. We will proceed with several sub-phases and in each sub-phase we use smaller and smaller steps so that we do not over shoot the actual solution.
4. Finally, we need to check the convergence which is estimated in terms of the deviation. Since standard deviations are pretty hard to estimate due to the lack of parameters, we sample a set of X according to our last estimation of θ, then from the sample we estimate standard deviation of different features and thereby the sufficient statistics.

### Goodness of Fit

After estimation of the parameters and estimation of our sample, it is important to check if they are a good fit to our model. In case of social networks, the properties you require your model to have are very dynamic in nature i.e. today you may require your model to fit certain properties and tomorrow you may require it to fit some other properties. Therefore whenever we have a model, we need to consider the goodness of fit. There are two ways to talk about goodness of fit.

1. One way is to talk about a common statistic. Suppose we have quantile-quantile plot for one distribution v/s another distribution and if these plots are within a narrow band, then we say it is a goodness of fit.
2. Another way to work is that, suppose the statistic has a Gaussian distribution, then we can perform chi-square tests and talk about the Gaussian distribution.

### Statnet package in R

As long as the network is static and not dynamic, all the methods we have discussed so far like sampling, parameter estimation and testing goodness of fit could be implemented using the statnet package in R.

Let us see how we can track or estimate dynamic networks. For example, let us consider the Florentine marriage data set which indicates the marriage relationships among different families. This data set has a set of attributes like wealth, population etc. which describe all the families resulting in different covariance for each of them. Edges are formed between two families if there is marriage relationship between them.

As discussed earlier, it is important for us to understand the marriage network and to sample this network. But the only thing we know are the edges that are formed. Considering the random graph model, the only thing we need to know is the number of edges so that we can sample it and fit the model. After we fit the model, we compute the p-value to check whether our model is statistically significant enough.

One of the ways to improve this model is through triangle inequality. Accordingly, if family A form marriage with family B and family B forms marriage with family C, then according to the principle of triangle inequality, family A should most likely form marriage with family C. But if you see the graph of the network, it is obvious that this principle does not hold true here as there is no triangle closure in case of some of the nodes. So we can proceed with model fitting. The below graph with the corresponding R code show the plot of the network in which the size of the node is proportional the wealth of the corresponding families. We find that the families with similar wealth are more likely to be form edges between them.

<pre>plot(flomarriage, vertex.cex=flomarriage %v% ‘wealth’ / 25, main=’Florentine marriage by wealth’)</pre>

We can include this factor in our model so that now model will be a function of both edges and the covariance of the wealth of the families. If you compare the p-value of the previous model (considering the triangle closure principle) with the p-value of this model, we find that p-value has improved from 0.79 to 0.027. We can infer that although this model is not statistically very significant but still there is some truth behind the principle – “Birds of same feather flock together”.

We can also sample the dynamic network by using the Gibbs sampling algorithm where we sample one edge and fix all other edges. We can simulate this network using a model that involves the density of edges as well as wealth and birds of the feather coefficient into consideration. These graphs are only sample graphs and will be different from the original network. We can see that the edges in the simulated graphs do not coincide with the edges in the original graph. The below graphs correspond to the 10 simulations of Florentine marriage network which are computed using the R code below.

flomodel.wealth.sim <- simulate(flomodel.wealth, nsim=10)
for(I in 1: length(flomodel.wealth.sim)) plot(flomodel.wealth.sim[[i]], vertex.cex=flomarriage %v% 'wealth' / 25, main='Florentine marriage simulaion')

In order to talk about the goodness of fit we can think of different statistics and then we can compare the different statistics in the target network with the different statistics in the original network and we see whether those are correct. The goodness of fit computation in R and the corresponding results are given below.

flomodel.wealth.gof <- gof(flomodel.wealth~degree + esp + distance)
plot(flomodel.wealth.gof)

Here we are fitting the family wealth with the degree of the edges and plus some other things. Note that we have many families. Thus we will have a distribution of the wealth of these families based on our graph. We can compare the distributions with the observed distributions. From these results, we can infer that this model is pretty good in simulating the dynamic network we have.