# Wrap up [11/19]

A classic example of how people belonging to different professions solve the same problem in their own distinctive ways. The task is to paint 4 walls of the room given a bucket of paint, which has sufficient amount of paint just to paint two walls of the room. This task was given to an Engineer, a Physicist and a Mathematician. Now we see how each of them goes about solving this problem.

The Engineer had painted two out of four walls and as expected he had emptied the bucket of paint, followed by the Physicist’s turn to paint the walls. The Physicist never started to paint the walls as he was trying to figure out the calculation of how best to use the paint bucket so as to paint all 4 walls of the room. However, the Mathematician had painted all 4 walls of the room and surprisingly also had the bucket of paint intact. Now let us reason out the approaches each of them had taken in painting the walls.

The Engineer never really thought about if the given quantity of paint would be enough to paint all 4 walls of the room. So he went about painting the walls until the paint was exhausted. As we saw earlier, the physicist was taking time to figure out the calculations before he actually started painting the walls. The reason being, a physician would always work towards computing the laws of the system and hence a lot of thought is required before solving the problem. On the other hand, the mathematician just painted the rational numbers on all 4 walls (countable infinite). This example shows that different people deal with problems in different ways.

The dataset that tracks the location of the people is useful in many aspects. One of the papers is to model spreading of epidemics like Malaria through human movement. This study focuses on human movement from poor employment regions to rich employment regions and vice versa, as people tend to reside at places that are less expensive i.e., poor employment regions. So, the stochastic modeling of population mobility will help us identifying the cause of Malaria diffusion in social network.

Similarly, another paper deals with “Poverty Analysis in Senegal” wherein the information flow in the network will play a significant role in determining the poverty of Senegal. The different areas in Senegal are represented as nodes of a virtual network. If a particular area is poor, then that node in the network would be least visited. So, we can eradicate poverty and also identify areas that are poor by implementing better models of information flow in the network. The researcher has used the Google page rank approach in ranking the poverties of different cities in Senegal.
Consequently, the price variation of Millet was captured in Senegal using a satellite map of production. The researcher again focused on information flow and deduced that the Millet would be fairly priced in all regions of the country only if information flow is handled well.

The poverty is a multidimensional entity wherein poverty can be measured in terms of wealth, education, healthcare and so on. So, we need to count in a number of factors to determine poverty. To find answers to these factors, a census must be conducted across all the places at Senegal. This involves a large number of social resources like people, time, travel, cost etc. Even then, it is very hard to cover the entire population or vast population of the country by conducting surveys. In order to overcome the shortcomings of surveys, we can use mobile phone dataset as a suitable replacement. The reason being, in this present generation almost 96% of the world uses a mobile phone and the connectivity of mobile phones have reached to such an extent that even people residing in remote villages could be reached. This aids in collecting survey results at a very fine level. The important factors in mobile phone datasets are location/mobility traces of people, interaction of cities/communities amongst others.

Another paper that details the “Survey results on mobile phone datasets” focuses on the fact that inferences drawn must be generalized to larger extent of society and less use of social resources. The researchers gather the Call Data Records (CDR’s) – How, when and with whom communication happened that includes message and call data.
Based on these CDR’s, a social network was constructed with nodes, as people and a link exist between two people if they have reciprocity in communication between them. Here, we can either implement a directional link to indicate source as the caller and destination as callee or associate a weight to the unidirectional link indicating the number of times communication happened between two entities.

The researchers used the mobility traces of people to determine the frequent locations visited by them. Also, the focus was on information diffusion in the network to represent problems such as spreading of epidemics, spreading of viruses through Bluetooth or multimedia messages, viral marketing amongst others. The behavior of a community was studied based on the digital signature of a city or a group of people. It was noticed that the people belong to the same community tend to communicate amongst themselves rather than communicating with people outside their community. This also brings in the influence of ethnicity in the population behavior.
The Gravity law states that the probability that two people are connected decreases with increase in spatial distance separating them. Secondly, the duration of call increases with increase in separating spatial distance. Both these points are valid only until a threshold value is reached in distance separating them. After which, the values become constant or very less changes are displayed.

Another interesting observation made in the above paper was duration for people to be in phone contact. As per the researchers, on an average people relocate every 7 years and when relocated they get to meet new people and henceforth-new contacts will be made. In order to maintain the contacts in the phone, some of the old unused contacts would be deleted.
Now we see how people benefit from contacting the weak ties for employment opportunities. The strong ties (people with whom we communicate on a daily basis) know pretty much the same information/contacts and hence not much benefit would be seen in terms of new job opportunities. However, when weak ties are contacted the network grows beyond the scope of know information and chances are high that relatively more job opportunities can be explored.

The paper also talks about the usage of social networks in disasters. If an emergency situation occurs (bomb blast/plane crash/earthquake), it is noticed that the eyewitnesses are contacted immediately after these events. Also, by analyzing the social amplifiers (nodes with highest degree) and its immediate neighbors it is possible to detect the emergencies. However, work has to be done in predicting the emergencies by using this social network.

By studying the mobility traces of people, it helps in determining the economy of the country. The more mobile the better economy and few larger airtime purchases indicate better economy. Consequently, the mobility of people helps in Urban Planning (improving the roads in developing countries) and also in better planning of transportation network to ease traffic congestion.

In spite of all the above benefits from mobile phone dataset, the privacy issues are often challenging. It is important for the service providers to not reveal the personal information of users like phoning number and age. Hence, suitable anonymization techniques must be employed to preserve the identity of the user.

Dwelling into the social dynamics aspect, the models such as flocking model, language model, culture model are used where in the impact of flock movement is evaluated on real-time problems. For example, the herd movement of people impacting the stock prices in stock market. Secondly, the researchers talk about interaction of particle systems to check if a generalizable claim can be made on the observed pattern across all datasets.
Netlogo is another language used primarily for simulation, used for generating game theory model in the formation of network and to facilitate information exchange so as to maintain fairness in the social network.

Urban dynamics is one interesting topic in social dynamics field. There are several good textbooks on this topic such as “Cities and the Complexity” and Urban Dynamics. Researchers used agent-based model to simulate the movement of people, resources, populations and so on and got some interesting conclusions. There are also some soft wares for simulating urban dynamics. One example is UrbanSim, where the urban dynamic is simulated at the agent and individual level. SimCity, which is a famous game, is an interactive simulation of urban dynamics.

Stochastic process can be also used to model the dynamics in system biology and chemical reaction. By discretizing the time, we can make inference about the concentrations of different chemicals we cannot observe. One interesting thing is the similar methodology here can be used to study the asymptotic behavior of our social systems because there are simple corresponding relationships between people of different and different chemicals.

Financial Modeling is also a field full of stochastic process. One famous process is Brownian process. Brownian process is continuous but does not have derivative. This means that the increment is always small but when the time interval goes to zero, the change will be dramatic. An advanced process named Levy process, which is also widely used in this field, is a combination of Brownian motion and jump process. In contrast with the previous dynamic model, there are currently no exact inference algorithms for these processes. The common tool is simulation.

The most used tools in this course for inference in dynamic modeling is variational inference method, which has strong relationship with the law of large numbers and the large deviation theory.

# Exponential Random Graph Model [10/13]

Le Yang, Harshith Kumar Ramadev and Karthik Kakarla Nagendra Prakash. October 13, 2015

### Introduction to Dynamic Networks

With these dynamic networks, what we can do?

1. Summarizing: What are the characteristics of a dynamic network data? For example, the triangle closure property says if A and B are friends and B and C are friends, then A and C are friends. This property can be summarized from different dataset including some dataset mentioned above.
2. Modeling: how to modeling the dynamic networks? There are two ways: a) from the perspective of economists and social scientists; b) from the perspective of physicists. Both perspectives will be discussed later in details.
3. Prediction: One important objective of research on dynamic network is to predict what will happen in the future. For example, in political relationship dataset, we want to know if one nation will support another nation or fight with it.

Network formation with game theory node/agent based modeling (economists and social scientists)

In this perspective, each node in network represents the individual people. Each people has utility function which they want to maximize. For example, people want to maximize their financial benefit or productivity. More importantly, the utility function is related with not only the resources of the individual but also the properties of the other edge through the edges. For example, people may cost money in forming or maintaining a link while gains profit in having connections to others.

One question is how can the individual agents build edges and get rid of edges to maximize the utilities of themselves. Once we have utility function for each individual, we can think of what network structure will emerge as a equilibrium of this game and what dynamic do we have in network evolution. From the overall perspective, another question is whether or not we have a way to design overall policy over the dynamic network so that the overall utility function can be maximized.

Network formation with tie based modeling (physicists)

First model is random graph model. In this model, a network have parameters N and P which means there are $n \choose 2$ potential edges and each of them might appear with probability P. Obviously, the average number of edges is $N \times P$. Some questions we may ask about this random graph is as follows:

• What is the average path length from one node to another node?
• What is the average distance?
• Whether there will be a lot of small clusters or only few large clusters?

Second model is so-called small world model. First we construct a rigor model where every node is connected with its full neighbors. Then we randomly rewire those edges from one node to any other randomly sampled node in the world. One interesting feature of this kind of random graph is the shortest path from any node to any other node is going to be the logarithm of the size of the world, i.e. the number of nodes in this network.

The third model is preferential attachment model constructed by Barabasi to talk about the fat tail distribution of the edges. The fat tail distribution means some people have a lot of connection and most of the people have very few connection. His idea is to start from some kernel and then grow the network by adding in new nodes. At each time a new node is added, it will establish connection with existing nodes according to the number of degrees. Obviously, rich get richer.

### Markov networks exhibiting Maximum Entropy property

The underlying idea in this part of the discussion is to form a dynamic network, which primarily means constructing a network by adding new edges or removing existing edges. Here we only consider the current state of the system and details about the previous system states are not taken into consideration. This process of starting from an initial state and transforming the network by adding/deleting edges to reach a state of equilibrium/asymptotic distribution demonstrates the principle of Markov networks exhibiting Maximum Entropy property.

When the features/sufficient statistics are known and also the average value of some functions of random variables are given, the probability distribution that best depicts the current state of the system would be having the largest entropy. This statement again is based on Markov property. The corresponding equation that shows this relation is given below:

Maximize $H(X) = – \int f(x)logf(x)d(x)$,

Subject to: $E(f_i(X)) = a_i$ for $i =1,\dots,n$.

Given below are the set of well-known family of probability distributions and their properties:

 Distribution Parameters θ Natural parameters η Base distribution h(x) Sufficient statistic T(x) Log-partition A(θ) Bernoulli p log (p/(1−p)) 1 x -log(1-p) Binomial p log (p/(1−p)) nCx x -nlog(1-p) Poisson λ log λ 1/x! x λ Exponential λ log λ 1 x -log(λ)

Here, we can brief about the Partition function i.e., $A(\theta)$. This function describes the statistical properties of the system in equilibrium state or asymptotic state. Also, we can understand the importance of base distribution i.e., $h(x)$ which gains significance when we are approximating a probability distribution with any other probability distribution of exponential family.

The equation that describes the relation of main statistics, sufficient statistics, Base function and Partition function on a given probability distribution is shown below:

$f_X(x;\theta)=h(x)exp(\eta(\theta)⋅T(x)−A(\theta))$

The directive graph with $n$ nodes and $n \choose 2$ edges which is used to analyze the data about social networks and other real world systems constitute the Exponential Random Graph Model (ERG’s). The edges are tied with average values of features/statistics.

The underlying question in the concept of ERG’s is, “What is the probability distribution of different graphs according to exponential distribution defined by exponential graph model?”.

The answer for this question can be represented by an equation as follows:

$P(X=x|\theta)=P_{\theta}(x)=c \times exp(\theta_1 z_1(x)+….+\theta_p z_p(x))$

The maximum entropy distribution of $p$ features/statistics/invariants is represented by $E(Z_1(x))….. E(Z_p(x))$.

Note that the variable $c$ in the above equation represents the “Exponential of partition function”.

The Sufficient statistic for a family of probability distributions is defined as the statistic defined for a probability distribution which gives all the information as given by a sample from which this statistic was calculated.

Below are some of the examples of scenarios and their sufficient statistics:

One tied scenarios:

1. a) Reciprocity – If A and B are friends, then B and A are also friends.
$M(x)=\sum_{i<j} x_{ij}*x_{ji}$
2. Birds of same feather flock together.
3. Bernoulli – Measured by density of edges. $\sum_{ij} x_{ij}/ \sum_{ij} 1$

Multiple tied scenarios:

1. Triangles – If there is an edge between points i and j, j and k; then there is also an edge between points i and k. The sufficient statistic for the number of triangles is illustrated as follows: $T(x)=\sum_{i<j<k} x_{ij}*x_{jk}*x_{ki}$.
2. Number of k-stars: the sufficient statistic for k-star scenario is given by:$S_k (x)=\sum_{i \in N} {x_{i+} \choose k}$, where $x_{i+} = \sum_{j} x_{ij}$.

In general, given a sufficient statistic and probability distribution, result obtained is a exponential random graph (Static Version). i.e., we will proceed to sample different random graphs according to the distribution defined by the features.

Dynamic ERG’s involve dynamic changes to the graph based only on the current state of the graph to attain the equilibrium state. Hence, we can track the dynamics of evolution of network based on limited observations about the network.

We can use the sampling algorithm to define the probability distribution of any sample path in terms of set of events. The sampling algorithm is as follows:

An ERG is represented by a $M \times M$ matrix, where a value of 0 in a cell shows that there is no edge between two vertices and a value of 1 shows that there is an edge between two vertices.

$X(t_m)$ to $X(t_{m+1})$ for $m=0,\dots,M$.

1. Sample next event time according to exponential distribution and update time t←t+Δt, where Δt ~ Exponential(ρ).

X is a function of time and the way to sample this graph is to first sample the next event according to the exponential distribution (maximum entropy property). The time of the event is calculated based on time at which first success event is observed in any experiment.

1. Sample directed pair $(i,j)$ uniformly and update $X_{ij}$ according to Gibbs Sampling.

Each pair can have two states – Edge or no edge. Using Gibbs sampling, we find the probability of having a edge on all (N-1) edges. This probability will be an exponential distribution. Taking the logarithm of this result will fetch us the linear form equation:

$P(X_{ij} = 1 | X \setminus X_{ij}) = \frac{P(X =\Delta^{+}_{ij}x)}{P(X =\Delta^{+}_{ij}x) + P(X =\Delta^{-}_{ij}x)}$

$log \frac{P(X_{ij} = 1 | X \setminus X_{ij})}{P(X_{ij} = 0| X \setminus X_{ij})} = \sum_k \theta_k (z_k(\Delta^{+}_{ij}x) -z_k(\Delta^{-}_{ij}x))$

where $\delta^{+}_{ij}z_k(x) = z_k(\Delta^{+}_{ij}x) -z_k(\Delta^{-}_{ij}x$ is called change statistic and  ‘+’ indicates there exists an edge and ‘-‘ indicates no edge between two vertices.

The equilibrium can be achieved using the Gibbs sampling approach of sampling path estimation.

Also, given the aggregate statistics such as number of people taking different courses and the time they spend in offices every day, we can perform the parameter estimation to each of these sufficient statistics, which will be explained below.

### Parameter estimation

Note that we already have a way to sample either static exponential random graph model or a dynamic exponential random graph model. So let us now discuss how we can estimate the parameters of those models. First let us try to sample a dynamic ERGM. This approach is called Geyer-Thompson approach which is based on Importance sampling.

1. Suppose that we have observations about the edges, we would like to sample the parameters in such a way that the observations are matched.
2. Sample $x(1),…,x(M)$ according to $θ~= (θ~1,…,θ~p)$. We sample this dynamic network model by first making $M$ samples of $X$ according to some initial parameters, where each $X$ is a matrix i.e. each X is an exponential random graph. Here the initial parameters are the ‘p’ features or ‘p’ sufficient statistics of $\theta$.
3. Once we have the initial sample, we iteratively improve the parameters according to the Newton’s method (Newton-Raphson method).
4. Now take another sample of x and repeat this process until z is maximized. After we have a good enough estimation of theta then we will repeat this process, normally we will need to take another sample because we have many local minima and maxima.

Details about the 3rd step: Newton’s method is a process to find the roots of functions whose graph crosses or just touches the x-axis. Consider a curve f(x) which is shown below. We would like to find the root of f(x) i.e. find the value of x for which f(x) will be equal to 0. We first assign a value to x say $x = x_0$ which is sufficiently close to the root. We then apply first order approximation of $f(x)$ at x = x We get a new value of x say $x_1$. Now we apply the first order approximation of f(x) at $x = x_1$, to find $x_2$. This procedure continues until you converge to a single value which is the root of the function. This is illustrated in the below example. Note that The Newton-Raphson method is just the multidimensional equivalent of this one dimensional Newton’s method.

Example:

1. We initially consider xo to be 6 and apply first order approximation of f(x) at x = 6. We will get the value of x1 as 3.33.
2. Now we apply first order approximation of f(x) at x = 3.33. We get x2 to be equal to 2.27.
3. Now we apply first order approximation to f(x) at x = 2.27. This will give us the value of x3 to be 2.01.
4. Now we apply first order approximation to f(x) at x = 2.01. This will give us the value of x4 to be 2.00. Now if you continue further you will get the same value of x which is 2.00. You can say that the value of x is converged to 2.00 which is the root of f(x).

Let us understand the intuition behind this method: One of the simplest functions one can deal with is a linear function: $f(x)=mx+b$. In particular, if you want the root of a linear function, it’s quite easily figured: $x=−bm$. Now, it is well-known that the tangent line of a function is the “best” linear approximation of a function in the vicinity of its point of tangency:

The first idea of the Newton-Raphson method is that, since it is easy to find the root of a linear function, we pretend that our complicated function is a line, and then find the root of a line, with the hope that the line’s crossing is an excellent approximation to the root we actually need.

Mathematically, if we have the tangent line of $f(x)$ at $x=a$, where ‘a’ is the “starting point”: $f(x) \approx f(a) + f′(a)(x−a)=0$. This is obtained by the Taylor series of the function $f(x)$ by considering only the first order terms. If we want equation in terms of $x$, then $x=a−\frac{f(a)}{f′(a)}$. Let’s call this $x_1$.

As you can see, the blue point corresponding to the approximation is a bit far off, which brings us to the second idea of Newton-Raphson: go for multiple iterations until you the reach the converged point.

As you can see, the new blue point is much nearer to the red point. Mathematically, this corresponds to finding the root of the new tangent line at $x = x_1$: $x_2 = x_1−\frac{f(x_1)}{f′(x_1)}$

This can be written in the form of a matrix. We can keep playing this game (with $x_2, x_3,\cdots, x_n$), up until the point that we find a value where the quantity $f(x_n)\over f′(x_n)$ is “tiny”. We then say that we have converged to an approximation of the root. That is the essence of Newton-Raphson.

The reason we first take a sample is to be able to estimate the sufficient statistics. This is given by $E_\theta(z(x)) \approx [\sum w^{(i)}* z(x^{(i)})]$ for $i = 1, 2, 3, \cdots,m$

This is how we estimate the sufficient statistics. Here $w(m)$: weight of $x_m$ and θ takes value of $\theta^{(g−1)}$ and

$$\begin{eqnarray*} w^{(m)} &=& \exp\left(\sum_{i=1}^p(\theta_i-\tilde\theta_i)z_i(x^{(m)})\right) \bigg / \sum_{m=1}^M \exp\left(\sum_{i=1}^p(\theta_i-\tilde\theta_i)z_i(x^{(m)})\right)\\ D(\theta) &=& \sum_m w^{(m)}z(x^{(m)})z(x^{(m)})^\mbox{T}-\left[\sum_m w^{(m)}z(x^{(m)})\right]\cdot\left[\sum_m w^{(m)}z(x^{(m)})\right]^\mbox{T}\\ \theta^{(g)} &=& \theta^{(g-1)}-D(\theta^{(g-1)})^{-1}\left[\sum_{m=1}^M w^{(m)}z(x^{(m)})-z(x_\mbox{obs})\right]\end{eqnarray*}$$

Here instead of taking the partial derivative of ‘f’ over ‘x’, what we have done is we have used the difference which approximates to the partial derivative. The equations above correspond to the difference between weighted values of the sufficient statistics which should be close to the estimated value of this one under this new sample. By these equations we find the next theta that is closer to the sufficient statistics that we are given. By iterating this we hope that in the end we will be able to get some theta that makes the expected value the empirical average to be close to this given average of the statistics.

Now let us see another approach for parameter estimation which is called Robins-Monro approach based on Metropolis-Hastings sampling. Given below are the steps followed in this approach.

1. We make a sample of X, and then for the sample, we compute the mean value of the features and the statistics of those features.
2. We iteratively sample the next exponential random graph and estimate the next parameters. We continue this process until it converges.
3. We will proceed with several sub-phases and in each sub-phase we use smaller and smaller steps so that we do not over shoot the actual solution.
4. Finally, we need to check the convergence which is estimated in terms of the deviation. Since standard deviations are pretty hard to estimate due to the lack of parameters, we sample a set of X according to our last estimation of θ, then from the sample we estimate standard deviation of different features and thereby the sufficient statistics.

### Goodness of Fit

After estimation of the parameters and estimation of our sample, it is important to check if they are a good fit to our model. In case of social networks, the properties you require your model to have are very dynamic in nature i.e. today you may require your model to fit certain properties and tomorrow you may require it to fit some other properties. Therefore whenever we have a model, we need to consider the goodness of fit. There are two ways to talk about goodness of fit.

1. One way is to talk about a common statistic. Suppose we have quantile-quantile plot for one distribution v/s another distribution and if these plots are within a narrow band, then we say it is a goodness of fit.
2. Another way to work is that, suppose the statistic has a Gaussian distribution, then we can perform chi-square tests and talk about the Gaussian distribution.

### Statnet package in R

As long as the network is static and not dynamic, all the methods we have discussed so far like sampling, parameter estimation and testing goodness of fit could be implemented using the statnet package in R.

Let us see how we can track or estimate dynamic networks. For example, let us consider the Florentine marriage data set which indicates the marriage relationships among different families. This data set has a set of attributes like wealth, population etc. which describe all the families resulting in different covariance for each of them. Edges are formed between two families if there is marriage relationship between them.

As discussed earlier, it is important for us to understand the marriage network and to sample this network. But the only thing we know are the edges that are formed. Considering the random graph model, the only thing we need to know is the number of edges so that we can sample it and fit the model. After we fit the model, we compute the p-value to check whether our model is statistically significant enough.

One of the ways to improve this model is through triangle inequality. Accordingly, if family A form marriage with family B and family B forms marriage with family C, then according to the principle of triangle inequality, family A should most likely form marriage with family C. But if you see the graph of the network, it is obvious that this principle does not hold true here as there is no triangle closure in case of some of the nodes. So we can proceed with model fitting. The below graph with the corresponding R code show the plot of the network in which the size of the node is proportional the wealth of the corresponding families. We find that the families with similar wealth are more likely to be form edges between them.

<pre>plot(flomarriage, vertex.cex=flomarriage %v% ‘wealth’ / 25, main=’Florentine marriage by wealth’)</pre>

We can include this factor in our model so that now model will be a function of both edges and the covariance of the wealth of the families. If you compare the p-value of the previous model (considering the triangle closure principle) with the p-value of this model, we find that p-value has improved from 0.79 to 0.027. We can infer that although this model is not statistically very significant but still there is some truth behind the principle – “Birds of same feather flock together”.

We can also sample the dynamic network by using the Gibbs sampling algorithm where we sample one edge and fix all other edges. We can simulate this network using a model that involves the density of edges as well as wealth and birds of the feather coefficient into consideration. These graphs are only sample graphs and will be different from the original network. We can see that the edges in the simulated graphs do not coincide with the edges in the original graph. The below graphs correspond to the 10 simulations of Florentine marriage network which are computed using the R code below.

flomodel.wealth.sim <- simulate(flomodel.wealth, nsim=10)
for(I in 1: length(flomodel.wealth.sim)) plot(flomodel.wealth.sim[[i]], vertex.cex=flomarriage %v% 'wealth' / 25, main='Florentine marriage simulaion')

In order to talk about the goodness of fit we can think of different statistics and then we can compare the different statistics in the target network with the different statistics in the original network and we see whether those are correct. The goodness of fit computation in R and the corresponding results are given below.

flomodel.wealth.gof <- gof(flomodel.wealth~degree + esp + distance)
plot(flomodel.wealth.gof)

Here we are fitting the family wealth with the degree of the edges and plus some other things. Note that we have many families. Thus we will have a distribution of the wealth of these families based on our graph. We can compare the distributions with the observed distributions. From these results, we can infer that this model is pretty good in simulating the dynamic network we have.