Prepared by Hao Fan, Manu Kothegal Shivakumar and Pritika Mehta

In this lecture, we will specific talk about how can we combine agent-based modeling on one hand and vehicle tracking data on other hand to make inference about the states of the roads dynamically.

The resources for us to understand our cities are not only the publication of papers or the books, more importantly are the models, computational tools, and visualization tools. Listed as follow are the open source simulation software.

• MATSim
• TRANSIMS — Transportation Analysis Simulation System
• DynaMIT & MITSIMLab — DynaMIT & MITSIMLab is developed by the civil engineering department at MIT, supported by the department of transportation.
• UrbanSim — UrbanSim is a tool to simulate the migration of people within a city and to simulate the interaction of different industries as well as people. It is pretty useful for people working with cities and working on social dynamics at large-scale.
• SimCity open source — SimCity is a famous computer game, and there are several versions of it. The first version of it was open-sourced, and there is also a java version.

If we work with real world cities, there can be two major problems. The first one is how can we combine those tools to make real time prediction about what will happen, the second one is how can we bridge the gap between the data that we can collect and the data that we will really find useful, and justify that our bridge can lead to convincing conclusions about our cities. In such cases, we need visualization tools to help display our models.

• OpenLayers — OpenLayers is a set of JavaScript to bring tiles of a map from map server.
• MapServer — MapServer is read in the geographical features such as roads, rivers, the boundaries of countries or administrative areas at the server side and then serve those tiles.
• Google Analytics — Google Analytics can help generate all types of dynamic charts about many different things.
• Google Maps API — After we get simulation and visualization tools, the next step is to find useful data. Below are some resources of data that can help us understand the cities.

Census Data

• Demographic & Health Surveys: The Demographics & Health Surveys contains data set created by people from the US about many different countries, it includes information such as the latitudes and longitudes of the different villages of the survey is and surveys such as the annual income, the average education, the birthrates and the death rate and other things.
• US Cenesus Bureau: US Cenesus Bureau contains comprehensive US census data, includes road segments, household statistics, travel patterns and so on.
• Senegal: National Agency of Statistics and Demography (ANSD) ANSD is data for development collaboration, contains it contains additional information about epidemics, employment and many other things.
• Open government initiative

Street Map & Yellow Pages

• Open Street Map: Open Street Map data is especially good in the US and in Europe, but it is also good at many other parts of the world.
• Yellow Pages

Call Detail Records

Data for Development of Ivory Coast and Senegal contains flow of population and the interaction of population in different countries. It can also used to track the spreading of epidemic diseases at country site, or make predictions about the economics.

Vehicle Tracking

• Niagara Frontiers Transportation Authority Developer Tools
• MTA Bus Time: There are many city governments developed some web API for people to extract the locations of the individual vehicles of different public transportation fleet, above are the web API of transportation at Buffalo and New York City.
• NYCT&L Taxi Data: NYCT&L Taxi Data is taxi data with information of latitudes and longitudes of all yellow cabs at Manhattan for over three years at a temporal resolution of one sample per minute. It can also help us understand whether there is a traffic, accident, or other things happening in the city.
• Cabspotting — Cabspotting is a data set that tracks the vehicles at San Francisco.
• mPat sample data
• Mobile Millennium

Mobile Millennium is a data set about the dynamics in the San Francisco Bay Area, including San Francisco, San Jose, Berkeley and many other cities.

In a paper “A survey of results on mobile phone datasets analysis” from Vincent D. Blondel, Adeline Decuyper and Gautier Krings, some advances made recently in the study of mobile phone datasets are reviewed.

Models of the behavior captured by mobile phone data sets

• Diffusion of information and diseases
• Trajectory as Levy flight, exploration vs exploration, very few habitats, predictability of next location, amount of commuting between two counties, neighborhood signatures revealed by calling patterns, movement and geographic information propagation in response to emergency
• Probability of making a phone call, duration of a phone vs distance between the two ends of this phone call, social/ethnic/historical partitioning revealed by phone calls, economic development vs diversity
• Duration of being in phone-contact, burstiness of inter phone call intervals
• Power law degree distribution, clustering/small world (maximizing utility), symmetry, community detection/ dynamics of small and large communities, inferring gender and age from call pattern

Applications of mobile phone data sets

• Urban sensing (o/d matrix, flow of people, classifying trips)
• Tracking diseases
• Viral marketing

In the Transportation Modeling, we want to identify a transportation simulator as event driven model and assign probability to different several paths of this event driven model. Based on this type of ideas if we identify the observation as the observation of the system, then we can think of simulator as hidden Markov process (although this kind of hidden Markov process is pretty complicated) and make inference about what’s happening about this hidden Markov process in terms of the observations.

In order to do the simulation and inference, several things are needed.

• Network file: information about roads The road network makes it a really a network. The nodes represent the road intersections, the edges are either the locations (such as buildings) or the road segments. There are road segments from one intersection to another intersection, and at these intersection a vehicle needs to make a decision whether it should go through one branch or another.
• Plans/population file: trips of all individuals between home, work, school etc. by car, walk, bus etc. Things actually determines the dynamic is the plans of people, the individuals. Such as their locations, places they go to on a typical day, the characteristics of the trips, and the times they plan those trips. The plans of the populations determine the dynamics because from the plans and the populations we can find out the state transition matrix
• Facility file: open time, close time and capacity of buildings
• Configuration file: choice to simulate traffic on a road, replan trips, score trips, generate output, etc.

The paper “A large-scale agent-based traffic microsimulation based on queue model” from Nurhan Cetin, Adrian Burri and Kai Nagela provide us a specific case study in terms of how the events are scheduled and how the events are registered against the timer and how we up to date the state of the road segments as well as the states of the agents in response to the different individual events.

• Each road has length, free speed and capacity.
• If number of cars ahead is less than capacity, a car can move to the next segment with free speed.
• Otherwise, the car wait (at a queue) until all cars move out.

Another paper “Generating complete all-day activity plans with genetic algorithms” from David Charypar and Kai Nagel provide us a dynamic model in terms of different people choose different trips every day.

• Change trip starting time
• Change route
• Change mode (car, bus, walk)
• Change destination

In this model, people change routes, change the travel mode and change the destination, and the road segments responds to the same demands in different ways at different times according to weather and many other things.

In event driven model we define set of events and events will change the state of the system in different ways. If a sample path of events with corresponding time is given we can find the probabilities.

$p\circ l_1\to p\circ l_2$
This represents one event where person p goes from one location $l_1$ to another location $l_2$.
The state of the system is defined in terms of the location of a person at time t and the number of vehicles on road segment A at time t.
$X^{(p)}_t$ is location of person $p$ in time t
$X^{(l)}_t$ is number of vehicles on road $l$ at time t

If road segment has a lot of vehicles then this vehicle will stay on the road segment A for longer time. If a road segment has a lot of vehicles then the driver might decide to go off to one branch instead of another branch. So the state transition matrix is actually a function of the state of this system. Anyway, we can identify this system in terms of just one event and then we can get the states of the event and from the states we can proceed to get the probability measure of different sample paths.

One way we can calculate probabilities is by using mean field approximation, a projection based method. Based on the model that we defined we will be able to find out the probability for us to make a transition from location $l_1$ to another location $l_2$, conditioned on our observations.

The posterior probability of event $pl_1\to pl_2$ and latent states $x_{t-1,t}^{(l)}$ , $x_{t-1,t}^{(p)}$ conditioned on observations is

\begin{align*}
& P(pl_{1}\to pl_{2},\{x_{t-1,t}^{(l)}:l\},\{x_{t-1,t}^{(p)},p\}|\mbox{obs})\\
& =\prod_{l}\alpha_{t-1}^{(l)}(x_{t-1}^{(l)})\cdot\prod_{p’}\alpha_{t-1}^{(p’)}(x_{t-1}^{(p’)})\\
& \cdot p_{l_{1},l_{2}}\cdot\delta(x_{t-1}^{(p)}=l_{1})\\
& \cdot\beta_{t}^{(l_{1})}(x_{t}^{(l_{1})})\delta(x_{t}^{(l_{1})}=x_{t-1}^{(l_{1})}-1)P(\mbox{obs}_{t}^{(l_{1})}|x_{t}^{(l_{1})})\\
& \cdot\beta_{t}^{(l_{2})}(x_{t}^{(l_{2})})\delta(x_{t}^{(l_{2})}=x_{t-1}^{(l_{2})}+1)P(\mbox{obs}_{t}^{(l_{2})}|x_{t}^{(l_{2})})\\
& \cdot\prod_{l\ne l_{1},l_{2}}\beta_{t}^{(l)}(x_{t}^{(l)})\delta(x_{t}^{(l)}=x_{t-1}^{(l)})P(\mbox{obs}_{t}^{(l)}|x_{t}^{(l)})\\
& \cdot\beta_{t}^{(p)}(x_{t}^{(p)})\delta(x_{t}^{(p)}=l_{2})P(\mbox{obs}_{t}^{(p)}|x_{t}^{(p)})\prod_{p’\ne p}\beta_{t-1}^{(p’)}(x_{t}^{(p’)})\delta(x_{t}^{(p’)}=x_{t-1}^{(p’)})P(\mbox{obs}_{t}^{(p’)}|x_{t}^{(p’)}).
\end{align*}

We can also estimate the probability that we do not have any event in a $\epsilon$ time (very small fraction of time)
The posterior probability of no event and latent states $x_{t-1,t}^{(l)}$ , $x_{t-1,t}^{(p)}$ conditioned on observations is
\begin{align*}
& P(\emptyset,\{x_{t-1,t}^{(l)}:l\},\{x_{t-1,t}^{(p)},p\}|\mbox{obs})\\
& =\prod_{l}\alpha_{t-1}^{(l)}(x_{t-1}^{(l)})\cdot\prod_{p’}\alpha_{t-1}^{(p’)}(x_{t-1}^{(p’)})\\
& \cdot\left(1-\sum_{p’:\mbox{person}}\sum_{\mbox{event }p’l_{1}\to p’l_{2}}p_{l_{1},l_{2}}\cdot\delta(x_{t-1}^{(p’)}=l_{1})\right)\\
& \cdot\prod_{l}\beta_{t}^{(l)}(x_{t}^{(l)})\delta(x_{t}^{(l)}=x_{t-1}^{(l)})P(\mbox{obs}_{t}^{(l)}|x_{t}^{(l)})\cdot\prod_{p’}\beta_{t-1}^{(p’)}(x_{t}^{(p’)})\delta(x_{t}^{(p’)}=x_{t-1}^{(p’)})P(\mbox{obs}_{t}^{(p’)}|x_{t}^{(p’)}).
\end{align*}

The state space is Combinatorial state space and it is big. For example, if we want to model San Francisco we normally have about sixty thousand road segments and if we want the simulation about a city to be realistic we need to have about three hundred thousand vehicles which is approximately 1% of the vehicles running in San Francisco for a whole day. The state space is going to be very big so we have to either use Markov Chain Monte Carlo or Variational method.

There are two ways to make inference using variational method. One is through mean field approximation meaning that the vehicles move around and the road segments change state according to mean field effects. This means that we will project all other things, all other states into just this one individual or one road segment. The problem becomes more tractable.

# Opinion Dynamics Case study: Tracking Contact-Induced disease at individual level with mobile phones [10/22]

Prepared by Ankit Sharma, Karthik Kiran and Mohit Arora

The idea is to combine our theories and models with big data sets to obtain inference about them and perform interesting analysis.

Cell phones have become an important platform for the understanding of social dynamics and influence, because of their pervasiveness, sensing capabilities, and computational power. Many applications have emerged in recent years in mobile health, mobile banking, location based services, media democracy, and social movements. With these new capabilities, we can potentially be able to identify exact points and times of infection for diseases, determine who most influences us to gain weight or become healthier, know exactly how information flows among employees and productivity emerges in our work spaces, and understand how rumors spread.

There are many data sets that track communities which are collected from mobile phones. There are many applications to track user behavior and generate very large data sets. The privacy of the user has to be taken care of according to the standards.

These data sets are inter related with one another forming a network of data sets.

Some of the Open source datasets are:

• Nokia Mobile Dataset – Sensors embedded in mobile phones and other wireless devices are used to collect large quantities of continuous data pertaining to the behavior of individuals and social networks.
• Reality Commons – Contains the dynamic of several communities of about 100 people each which were collected in the MIT Human Dynamics Lab.
• MERL (Mitsubishi Electronic Research Lab)Motion detector data set – Data set collected with printed circuit boards worn by the researchers of the MERL for nearly a year which gives the traces in the life of a research laboratory. There were also anchor nodes installed at different corners of a floor. From those kinds of things we can use localization algorithm to find out the exact locations of the workers, for the workers to finish the task, so where have they been, with which other persons they have been interacting and observations of similar kinds.
• Dartmouth Student Life Dataset – The whole Student Life dataset is in one big file. This contains all the sensor data, EMA (ecological momentary assessments) data, survey responses and educational data.
• Telecom Italia Big Data Challenge – The dataset provides information about the telecommunication activity over different places.
• CRAWDAD – Crawdad is a community resource for archiving wireless data at Dartmouth containing many such data sets about networks. Those sensor networks may not necessarily be collected with mobile phones, could be collected another such as embedded devices or printer circuit boards. The data set includes data reflecting interaction among students, students in dormitories, workers etc.

In the data set where mobile phones collect data based on blue tooth proximity, mobile phones can scan another phone within the range which is an indication of proximity. Information about nearby access points within range of the mobile phones are also retrieved in the process. By knowing the location of wifi access points, the location of people can be determined. By knowing the building in which the wifi access point is present, the activity of students can be determined.

Above is a snapshot of blue tooth proximity in the data set. People are grouped into clusters and there are multiple clusters. This is a snapshot of proximity in the data among a group of 100 people who know one another.

The second plot shows the locations of students in one week. The y axis is indexed by hour of a week. The x axis is indexed by different locations representing different wifi access points.

It can be seen that the thick yellow line represents the time when the students are in the dormitory and the other yellow lines represent the classrooms. This gives an overall picture of behavior of students and their interaction.

From the above plots, it can be seen that there is evidence about diffusion in the network. Relationship between activity, proximity and friendship can be obtained.

For each person, distribution of the different access points is plot. There are multiple distributions obtained when data is obtained for many people and the correlation of this distribution gives the behavior and other characteristics of people.

In the third plot, ordering of correlation of non friends and correlation of friends are plotted against each other. What we see here is that when the correlation of friends becomes about 40, the correlation of non-friends is still about 20%. This means that friends have higher correlation about their visit of location. This means that if we know that two persons are friends then we could be able to predict the behavior of one person from the behavior of another person. This means that from the sensor data we can actually be able to predict the behavior of people through their friendship.

In the second plot, the aerobic activity of person A and person B are plotted. From the plot it can be seen that, if A and B are friends then number of aerobic exercises per week of A can be determined by number of aerobic exercises per week of B as they as correlated. So from this dataset, behaviors of friend pairs like physical activity can also be determined. Friends converge on type of behavior.

Another piece of information we have is result from cold and flu surveys. Daily reports are take n from students about cold and flu. First plot is report of fever and second is for running nose (common cold).

Red mark indicates that the student has caught fever/cold on the particular day represented by X-axis.

The count of students catching fever is much lesser than students catching cold. Black marks indicate not reported data which the students might have not been able to report on some days.

What we can do from this data is to how can we track this gradient of cold and flu at the individual level from those symptoms report and from the proximity network. The report generated is for a smaller population and task is to generate models which can infer from this data and apply the results to a larger population.

In order to do that infectious susceptible model can be used.

Agent-based Susceptible-Infectious-Susceptible model:

$$S+I\to 2I \mbox{ with rate } \beta\\ I\to S \mbox{ with rate } \gamma$$

This model has two events. One event is called infection and another event is called recovery. Here, S represents a susceptible individual and I represent an infectious individual. Following are the softwares that can be used to simulate an epidemic spreading dynamics:

1. Influenza Epidemics Simulation Model: This is actually a model to simulate the spreading of epidemic Influenza over the globe. The input data includes the number of passengers go from one airport to another airport globally, as well as the local movement on highways and on railroads.
2. Spatial Temporal Epidemiological Modeler (STEM): It is developed in Java. It also has the capability to simulate the dynamics of epidemics globally but most of the examples are at the city level.

SIS dynamics:

Input:

• $G=(N,E) \mbox{: dynamic network} \\ N=\{1,\dots,\mbox{number of people}\} \mbox{: people} \\ E=\{(n_1,n_2,t):n_1\mbox{ is near }n_2\mbox{ at time }t\} \mbox{ : “nearby” relation}$
• $\alpha\sim\mbox{beta}(a,b) \mbox{: probability of infection from outside,} \\ \beta\sim\mbox{beta}(a’,b’) \mbox{: probability of infection from within the network,} \\ \gamma\sim\mbox(a”,b”) \mbox{: probability of recovery.}$

After we have defined the probability for each sample path, we will be able to make inferences.

• $P(Y|X):$ how symptoms are dependent on common cold and flu state.

The output of this is in terms of a matrix, X and Y. X is the latent state. X can take two states, which is either zero, meaning a person is susceptible, and one meaning a person is infected or infectious.

Output:

• A matrix structure $\{X_{n,t},Y_{n,t}:n,t\}$ indexed by time t and node n. Latent state $X_{n,t}$ of node n at time t is either 0 (susceptible) or 1 (infected). Symptom $Y_{n,t}$ is probabilistically dependent on the state $X_{n,t}$.
• A collection of events R drives change in state.

We also have symptoms, which is a vector. Include whether a person has fever, has runny nose, has diarrhea, is stressed, is sad or something like that.

Now we want to understand how this infection spreads. This simulation model works in the following way.

• Sample parameters according to beta distributions. $\alpha\sim\mbox{beta}(a,b) , \beta\sim\mbox{beta}(a’,b’) and \gamma\sim\mbox(a”,b”)$
• $X_{n,1}=0,\forall n\in N$ : all people are susceptible at t=1

After we fix those three parameters, we work with a Markov process:

This means that we first sample the state of all people at time 1. After we have the state of people at the beginning of our simulation, we iterate over time. At each time we will randomly sample the probability for infectious people to be recovered with the parameter gamma. If we have an edge that connects an infectious person and a susceptible person, we will sample the probability of this infectious person to infect this susceptible person. Also, for any susceptible person in this system, we will sample the probability for another person outside of the system to infect this susceptible person in the system.

Probability Measure Defined by Agent-based SIS Model:

$\begin{eqnarray} && P\left({X_{n,t},Y_{n,t}:n,t},\alpha,\beta,\gamma\right)\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}P(X_{n,t+1}|\{X_{n’,t}:n’\},\alpha,\beta,\gamma)\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}\gamma^{1_{X_{n,t}=1\wedge X_{n,t+1}=0}}\cdot (1-\gamma)^{1_{X_{n,t}=1\wedge X_{n,t+1}=1}}\cdot \\ &&\left[\alpha+\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=1}}\cdot \left[1-\alpha-\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=0}}\cdot\\ &&\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ \end{eqnarray}$

We can observe the four different types of events here:
1. If a person at time T is infectious and time T+1 is susceptible. This is our recovery event (gamma).
2. The probability for an infectious person to be still infectious, is consequently one minus gamma.
3. If a person at time T is zero and at time T+1 is one, this is an event of infection.
4. The total probability of infection from another person within the system and not infected, which means that we have state zero at time T and state zero at time T+1 is one minus this number.

Gibbs-Sampling of Agent-based SIS Model:

$\begin{eqnarray} X_{n,t+1}|\{X,Y\}\backslash X_{n,t+1};\alpha,\beta,\gamma &\sim& \mbox{Bernoulli}\left(\frac{P(X_{n,t+1}=1)}{\sum_{x=0,1}P(X_{n,t+1}=x)}\right)\\ &&\mbox{where }P(X_{n,t+1}=1)=P(X|\alpha,\beta,\gamma)P(Y|X)\\ R_{n,t}&\sim&\mbox{Categorical}\left(\frac{\alpha,\beta,\dots,\beta}{\alpha+\beta\sum\limits _{n’}1_{(n’,n)\in E_{t}\cap X_{n’,t}=1}}\right)\\ \alpha &\sim& \mbox{Beta}(a+\sum_{n,t}1_{\{R_{n,t}=1\}},b+\sum_{n,t:X_{n,t}=0}1-\sum_{n,t}1_{\{R_{n,t}=1\}})\\ \beta &\sim& \mbox{Beta}(a’+\sum_{n,t}1_{\{R_{n,t}>1\}},b’+\sum_{n,t:X_{n,t}=0;n’}1_{(n’,n)\in E_{t}\cap X_{n’,t}=1}-\sum_{n,t}1_{\{R_{n,t}>1\}})\\ \gamma &\sim& \mbox{Beta}(a”+\sum_{n,t:X_{n,t}=1}1_{\{X_{n,t+1}=0\}},b”+\sum_{n,t:X_{n,t}=1}1-\sum_{n,t:X_{n,t}=1}1_{\{X_{n,t+1}=0\}}). \end{eqnarray}$

The Gibbs sampling works if we know the probability distribution of one variable condition and condition of other variables. The variables in our system include alpha, beta, gamma, which are the rates of events. We also have x which is the state of the individuals. What we do is we iterate over all persons and all times and we sample those x conditions on everything else. First we sample alpha, then we sample beta, and then gamma. Then we come back to sample all of those x. Theoretically, after we have iterated this sampling process for long enough time, we will be able to get the distribution of not only the rates of the system, but also the states of the individuals and the joint states.

Variational Inference with Agent-based SIS Model

The challenge here is that we are coping with a system with a huge number of states. In order to make the problem tractable we need to use some variational influence method. The variational influence method that we have used is called expectation propagation.

$\begin{eqnarray} \alpha(X_{n,t+1}=1) &=& \alpha(X_{n,t}=0)\cdot (a+b\sum_{(n’,n,t)\in E}\gamma(X_{n’,t}))\cdot P(\mbox{obs}|X_{n,t+1}=1)\\ &&+\alpha(X_{n,t}=1)\cdot(1-c)\cdot P(\mbox{obs}|X_{n,t+1}=1)\\ \\ \alpha(X_{n,t+1}=0) &=& \alpha(X_{n,t}=0)\cdot (1-a-b\sum_{(n’,n,t)\in E}\gamma(X_{n’,t}))\cdot P(\mbox{obs}|X_{n,t+1}==0)\\ &&+\alpha(X_{n,t}=1)\cdot c\cdot P(\mbox{obs}|X_{n,t+1}=0) \end{eqnarray}$

The probability alpha, such that the probability for a person n at time T+1 to be infectious which is 1 equals the probability for this person at the previous time step to be susceptible times the probability for this person to be infected between time T and T+1. Another probability, that this person is infectious at time T and this person hasn’t been recovered between time T and time T+1 times the probability observation.

Similarly we can iteratively estimate the probability for a person to be susceptible at time T+1 from the probability of the four statistics, for this person to be infectious or susceptible at time T.

Validation of Agent-based SIS Model to Fit Data:

The next thing that we want to do is to validate, to make the case that this model is applicable to the data and the dynamics as we hypothesized. The way to make the case that the model is applicable to the data is through hypothesis testing. Here are the three observations:

First one shows that the infection is related to the network. This means that if I have a lot of friends, who contacts infected today then I will have a higher probability of being infected tomorrow. This is done through a permutation test, i.e.,permute the symptoms of the individuals within a system. This makes the network independent.

The next plot is of people reported certain symptoms and the number of friends with symptoms with its log probability.

Last plot has the probability distribution of recovery versus the theoretical probability, assuming exponential distribution. This means that this hypothesis also works. If it doesn’t work, the probability for us to see those kinds of plots will be very low.

Performance of Agent-based SIS Model

Now we will predict symptoms from dynamic network and symptoms from “probes” by comparing our estimation with “ground truth”.

The black line here is the ground truth and the red line here is the result of the algorithm. The blue line here is the result of scaling. It turns out that scaling does not work very well. This is saying that by combining the model with the data set we can actually be able to estimate the infectious population quite well.

Application to real world:

In summary, in this lecture we discussed a specific example of how we can combine a specific model with real world data set and to make predictions about it. Once we are able to do that, then we can tweak the network and  be able to direct the infection or information flow in the ways that we want to maximize the utility of the network. Similarly we can do many other things, for example, to combine the simulation models about transportation with transporting data set and to apply the behavior data, how people move around in a city and how people interact with one another and combine this type of data with the systems dynamic model about the economic development of a city and talk about how we can tune the behavior and interaction of people to increase the productivity of the city.

# Exponential Random Graph Model [10/13]

Le Yang, Harshith Kumar Ramadev and Karthik Kakarla Nagendra Prakash. October 13, 2015

### Introduction to Dynamic Networks

With these dynamic networks, what we can do?

1. Summarizing: What are the characteristics of a dynamic network data? For example, the triangle closure property says if A and B are friends and B and C are friends, then A and C are friends. This property can be summarized from different dataset including some dataset mentioned above.
2. Modeling: how to modeling the dynamic networks? There are two ways: a) from the perspective of economists and social scientists; b) from the perspective of physicists. Both perspectives will be discussed later in details.
3. Prediction: One important objective of research on dynamic network is to predict what will happen in the future. For example, in political relationship dataset, we want to know if one nation will support another nation or fight with it.

Network formation with game theory node/agent based modeling (economists and social scientists)

In this perspective, each node in network represents the individual people. Each people has utility function which they want to maximize. For example, people want to maximize their financial benefit or productivity. More importantly, the utility function is related with not only the resources of the individual but also the properties of the other edge through the edges. For example, people may cost money in forming or maintaining a link while gains profit in having connections to others.

One question is how can the individual agents build edges and get rid of edges to maximize the utilities of themselves. Once we have utility function for each individual, we can think of what network structure will emerge as a equilibrium of this game and what dynamic do we have in network evolution. From the overall perspective, another question is whether or not we have a way to design overall policy over the dynamic network so that the overall utility function can be maximized.

Network formation with tie based modeling (physicists)

First model is random graph model. In this model, a network have parameters N and P which means there are $n \choose 2$ potential edges and each of them might appear with probability P. Obviously, the average number of edges is $N \times P$. Some questions we may ask about this random graph is as follows:

• What is the average path length from one node to another node?
• What is the average distance?
• Whether there will be a lot of small clusters or only few large clusters?

Second model is so-called small world model. First we construct a rigor model where every node is connected with its full neighbors. Then we randomly rewire those edges from one node to any other randomly sampled node in the world. One interesting feature of this kind of random graph is the shortest path from any node to any other node is going to be the logarithm of the size of the world, i.e. the number of nodes in this network.

The third model is preferential attachment model constructed by Barabasi to talk about the fat tail distribution of the edges. The fat tail distribution means some people have a lot of connection and most of the people have very few connection. His idea is to start from some kernel and then grow the network by adding in new nodes. At each time a new node is added, it will establish connection with existing nodes according to the number of degrees. Obviously, rich get richer.

### Markov networks exhibiting Maximum Entropy property

The underlying idea in this part of the discussion is to form a dynamic network, which primarily means constructing a network by adding new edges or removing existing edges. Here we only consider the current state of the system and details about the previous system states are not taken into consideration. This process of starting from an initial state and transforming the network by adding/deleting edges to reach a state of equilibrium/asymptotic distribution demonstrates the principle of Markov networks exhibiting Maximum Entropy property.

When the features/sufficient statistics are known and also the average value of some functions of random variables are given, the probability distribution that best depicts the current state of the system would be having the largest entropy. This statement again is based on Markov property. The corresponding equation that shows this relation is given below:

Maximize $H(X) = – \int f(x)logf(x)d(x)$,

Subject to: $E(f_i(X)) = a_i$ for $i =1,\dots,n$.

Given below are the set of well-known family of probability distributions and their properties:

 Distribution Parameters θ Natural parameters η Base distribution h(x) Sufficient statistic T(x) Log-partition A(θ) Bernoulli p log (p/(1−p)) 1 x -log(1-p) Binomial p log (p/(1−p)) nCx x -nlog(1-p) Poisson λ log λ 1/x! x λ Exponential λ log λ 1 x -log(λ)

Here, we can brief about the Partition function i.e., $A(\theta)$. This function describes the statistical properties of the system in equilibrium state or asymptotic state. Also, we can understand the importance of base distribution i.e., $h(x)$ which gains significance when we are approximating a probability distribution with any other probability distribution of exponential family.

The equation that describes the relation of main statistics, sufficient statistics, Base function and Partition function on a given probability distribution is shown below:

$f_X(x;\theta)=h(x)exp(\eta(\theta)⋅T(x)−A(\theta))$

The directive graph with $n$ nodes and $n \choose 2$ edges which is used to analyze the data about social networks and other real world systems constitute the Exponential Random Graph Model (ERG’s). The edges are tied with average values of features/statistics.

The underlying question in the concept of ERG’s is, “What is the probability distribution of different graphs according to exponential distribution defined by exponential graph model?”.

The answer for this question can be represented by an equation as follows:

$P(X=x|\theta)=P_{\theta}(x)=c \times exp(\theta_1 z_1(x)+….+\theta_p z_p(x))$

The maximum entropy distribution of $p$ features/statistics/invariants is represented by $E(Z_1(x))….. E(Z_p(x))$.

Note that the variable $c$ in the above equation represents the “Exponential of partition function”.

The Sufficient statistic for a family of probability distributions is defined as the statistic defined for a probability distribution which gives all the information as given by a sample from which this statistic was calculated.

Below are some of the examples of scenarios and their sufficient statistics:

One tied scenarios:

1. a) Reciprocity – If A and B are friends, then B and A are also friends.
$M(x)=\sum_{i<j} x_{ij}*x_{ji}$
2. Birds of same feather flock together.
3. Bernoulli – Measured by density of edges. $\sum_{ij} x_{ij}/ \sum_{ij} 1$

Multiple tied scenarios:

1. Triangles – If there is an edge between points i and j, j and k; then there is also an edge between points i and k. The sufficient statistic for the number of triangles is illustrated as follows: $T(x)=\sum_{i<j<k} x_{ij}*x_{jk}*x_{ki}$.
2. Number of k-stars: the sufficient statistic for k-star scenario is given by:$S_k (x)=\sum_{i \in N} {x_{i+} \choose k}$, where $x_{i+} = \sum_{j} x_{ij}$.

In general, given a sufficient statistic and probability distribution, result obtained is a exponential random graph (Static Version). i.e., we will proceed to sample different random graphs according to the distribution defined by the features.

Dynamic ERG’s involve dynamic changes to the graph based only on the current state of the graph to attain the equilibrium state. Hence, we can track the dynamics of evolution of network based on limited observations about the network.

We can use the sampling algorithm to define the probability distribution of any sample path in terms of set of events. The sampling algorithm is as follows:

An ERG is represented by a $M \times M$ matrix, where a value of 0 in a cell shows that there is no edge between two vertices and a value of 1 shows that there is an edge between two vertices.

$X(t_m)$ to $X(t_{m+1})$ for $m=0,\dots,M$.

1. Sample next event time according to exponential distribution and update time t←t+Δt, where Δt ~ Exponential(ρ).

X is a function of time and the way to sample this graph is to first sample the next event according to the exponential distribution (maximum entropy property). The time of the event is calculated based on time at which first success event is observed in any experiment.

1. Sample directed pair $(i,j)$ uniformly and update $X_{ij}$ according to Gibbs Sampling.

Each pair can have two states – Edge or no edge. Using Gibbs sampling, we find the probability of having a edge on all (N-1) edges. This probability will be an exponential distribution. Taking the logarithm of this result will fetch us the linear form equation:

$P(X_{ij} = 1 | X \setminus X_{ij}) = \frac{P(X =\Delta^{+}_{ij}x)}{P(X =\Delta^{+}_{ij}x) + P(X =\Delta^{-}_{ij}x)}$

$log \frac{P(X_{ij} = 1 | X \setminus X_{ij})}{P(X_{ij} = 0| X \setminus X_{ij})} = \sum_k \theta_k (z_k(\Delta^{+}_{ij}x) -z_k(\Delta^{-}_{ij}x))$

where $\delta^{+}_{ij}z_k(x) = z_k(\Delta^{+}_{ij}x) -z_k(\Delta^{-}_{ij}x$ is called change statistic and  ‘+’ indicates there exists an edge and ‘-‘ indicates no edge between two vertices.

The equilibrium can be achieved using the Gibbs sampling approach of sampling path estimation.

Also, given the aggregate statistics such as number of people taking different courses and the time they spend in offices every day, we can perform the parameter estimation to each of these sufficient statistics, which will be explained below.

### Parameter estimation

Note that we already have a way to sample either static exponential random graph model or a dynamic exponential random graph model. So let us now discuss how we can estimate the parameters of those models. First let us try to sample a dynamic ERGM. This approach is called Geyer-Thompson approach which is based on Importance sampling.

1. Suppose that we have observations about the edges, we would like to sample the parameters in such a way that the observations are matched.
2. Sample $x(1),…,x(M)$ according to $θ~= (θ~1,…,θ~p)$. We sample this dynamic network model by first making $M$ samples of $X$ according to some initial parameters, where each $X$ is a matrix i.e. each X is an exponential random graph. Here the initial parameters are the ‘p’ features or ‘p’ sufficient statistics of $\theta$.
3. Once we have the initial sample, we iteratively improve the parameters according to the Newton’s method (Newton-Raphson method).
4. Now take another sample of x and repeat this process until z is maximized. After we have a good enough estimation of theta then we will repeat this process, normally we will need to take another sample because we have many local minima and maxima.

Details about the 3rd step: Newton’s method is a process to find the roots of functions whose graph crosses or just touches the x-axis. Consider a curve f(x) which is shown below. We would like to find the root of f(x) i.e. find the value of x for which f(x) will be equal to 0. We first assign a value to x say $x = x_0$ which is sufficiently close to the root. We then apply first order approximation of $f(x)$ at x = x We get a new value of x say $x_1$. Now we apply the first order approximation of f(x) at $x = x_1$, to find $x_2$. This procedure continues until you converge to a single value which is the root of the function. This is illustrated in the below example. Note that The Newton-Raphson method is just the multidimensional equivalent of this one dimensional Newton’s method.

Example:

1. We initially consider xo to be 6 and apply first order approximation of f(x) at x = 6. We will get the value of x1 as 3.33.
2. Now we apply first order approximation of f(x) at x = 3.33. We get x2 to be equal to 2.27.
3. Now we apply first order approximation to f(x) at x = 2.27. This will give us the value of x3 to be 2.01.
4. Now we apply first order approximation to f(x) at x = 2.01. This will give us the value of x4 to be 2.00. Now if you continue further you will get the same value of x which is 2.00. You can say that the value of x is converged to 2.00 which is the root of f(x).

Let us understand the intuition behind this method: One of the simplest functions one can deal with is a linear function: $f(x)=mx+b$. In particular, if you want the root of a linear function, it’s quite easily figured: $x=−bm$. Now, it is well-known that the tangent line of a function is the “best” linear approximation of a function in the vicinity of its point of tangency:

The first idea of the Newton-Raphson method is that, since it is easy to find the root of a linear function, we pretend that our complicated function is a line, and then find the root of a line, with the hope that the line’s crossing is an excellent approximation to the root we actually need.

Mathematically, if we have the tangent line of $f(x)$ at $x=a$, where ‘a’ is the “starting point”: $f(x) \approx f(a) + f′(a)(x−a)=0$. This is obtained by the Taylor series of the function $f(x)$ by considering only the first order terms. If we want equation in terms of $x$, then $x=a−\frac{f(a)}{f′(a)}$. Let’s call this $x_1$.

As you can see, the blue point corresponding to the approximation is a bit far off, which brings us to the second idea of Newton-Raphson: go for multiple iterations until you the reach the converged point.

As you can see, the new blue point is much nearer to the red point. Mathematically, this corresponds to finding the root of the new tangent line at $x = x_1$: $x_2 = x_1−\frac{f(x_1)}{f′(x_1)}$

This can be written in the form of a matrix. We can keep playing this game (with $x_2, x_3,\cdots, x_n$), up until the point that we find a value where the quantity $f(x_n)\over f′(x_n)$ is “tiny”. We then say that we have converged to an approximation of the root. That is the essence of Newton-Raphson.

The reason we first take a sample is to be able to estimate the sufficient statistics. This is given by $E_\theta(z(x)) \approx [\sum w^{(i)}* z(x^{(i)})]$ for $i = 1, 2, 3, \cdots,m$

This is how we estimate the sufficient statistics. Here $w(m)$: weight of $x_m$ and θ takes value of $\theta^{(g−1)}$ and

$$\begin{eqnarray*} w^{(m)} &=& \exp\left(\sum_{i=1}^p(\theta_i-\tilde\theta_i)z_i(x^{(m)})\right) \bigg / \sum_{m=1}^M \exp\left(\sum_{i=1}^p(\theta_i-\tilde\theta_i)z_i(x^{(m)})\right)\\ D(\theta) &=& \sum_m w^{(m)}z(x^{(m)})z(x^{(m)})^\mbox{T}-\left[\sum_m w^{(m)}z(x^{(m)})\right]\cdot\left[\sum_m w^{(m)}z(x^{(m)})\right]^\mbox{T}\\ \theta^{(g)} &=& \theta^{(g-1)}-D(\theta^{(g-1)})^{-1}\left[\sum_{m=1}^M w^{(m)}z(x^{(m)})-z(x_\mbox{obs})\right]\end{eqnarray*}$$

Here instead of taking the partial derivative of ‘f’ over ‘x’, what we have done is we have used the difference which approximates to the partial derivative. The equations above correspond to the difference between weighted values of the sufficient statistics which should be close to the estimated value of this one under this new sample. By these equations we find the next theta that is closer to the sufficient statistics that we are given. By iterating this we hope that in the end we will be able to get some theta that makes the expected value the empirical average to be close to this given average of the statistics.

Now let us see another approach for parameter estimation which is called Robins-Monro approach based on Metropolis-Hastings sampling. Given below are the steps followed in this approach.

1. We make a sample of X, and then for the sample, we compute the mean value of the features and the statistics of those features.
2. We iteratively sample the next exponential random graph and estimate the next parameters. We continue this process until it converges.
3. We will proceed with several sub-phases and in each sub-phase we use smaller and smaller steps so that we do not over shoot the actual solution.
4. Finally, we need to check the convergence which is estimated in terms of the deviation. Since standard deviations are pretty hard to estimate due to the lack of parameters, we sample a set of X according to our last estimation of θ, then from the sample we estimate standard deviation of different features and thereby the sufficient statistics.

### Goodness of Fit

After estimation of the parameters and estimation of our sample, it is important to check if they are a good fit to our model. In case of social networks, the properties you require your model to have are very dynamic in nature i.e. today you may require your model to fit certain properties and tomorrow you may require it to fit some other properties. Therefore whenever we have a model, we need to consider the goodness of fit. There are two ways to talk about goodness of fit.

1. One way is to talk about a common statistic. Suppose we have quantile-quantile plot for one distribution v/s another distribution and if these plots are within a narrow band, then we say it is a goodness of fit.
2. Another way to work is that, suppose the statistic has a Gaussian distribution, then we can perform chi-square tests and talk about the Gaussian distribution.

### Statnet package in R

As long as the network is static and not dynamic, all the methods we have discussed so far like sampling, parameter estimation and testing goodness of fit could be implemented using the statnet package in R.

Let us see how we can track or estimate dynamic networks. For example, let us consider the Florentine marriage data set which indicates the marriage relationships among different families. This data set has a set of attributes like wealth, population etc. which describe all the families resulting in different covariance for each of them. Edges are formed between two families if there is marriage relationship between them.

As discussed earlier, it is important for us to understand the marriage network and to sample this network. But the only thing we know are the edges that are formed. Considering the random graph model, the only thing we need to know is the number of edges so that we can sample it and fit the model. After we fit the model, we compute the p-value to check whether our model is statistically significant enough.

One of the ways to improve this model is through triangle inequality. Accordingly, if family A form marriage with family B and family B forms marriage with family C, then according to the principle of triangle inequality, family A should most likely form marriage with family C. But if you see the graph of the network, it is obvious that this principle does not hold true here as there is no triangle closure in case of some of the nodes. So we can proceed with model fitting. The below graph with the corresponding R code show the plot of the network in which the size of the node is proportional the wealth of the corresponding families. We find that the families with similar wealth are more likely to be form edges between them.

<pre>plot(flomarriage, vertex.cex=flomarriage %v% ‘wealth’ / 25, main=’Florentine marriage by wealth’)</pre>

We can include this factor in our model so that now model will be a function of both edges and the covariance of the wealth of the families. If you compare the p-value of the previous model (considering the triangle closure principle) with the p-value of this model, we find that p-value has improved from 0.79 to 0.027. We can infer that although this model is not statistically very significant but still there is some truth behind the principle – “Birds of same feather flock together”.

We can also sample the dynamic network by using the Gibbs sampling algorithm where we sample one edge and fix all other edges. We can simulate this network using a model that involves the density of edges as well as wealth and birds of the feather coefficient into consideration. These graphs are only sample graphs and will be different from the original network. We can see that the edges in the simulated graphs do not coincide with the edges in the original graph. The below graphs correspond to the 10 simulations of Florentine marriage network which are computed using the R code below.

flomodel.wealth.sim <- simulate(flomodel.wealth, nsim=10)
for(I in 1: length(flomodel.wealth.sim)) plot(flomodel.wealth.sim[[i]], vertex.cex=flomarriage %v% 'wealth' / 25, main='Florentine marriage simulaion')

In order to talk about the goodness of fit we can think of different statistics and then we can compare the different statistics in the target network with the different statistics in the original network and we see whether those are correct. The goodness of fit computation in R and the corresponding results are given below.

flomodel.wealth.gof <- gof(flomodel.wealth~degree + esp + distance)
plot(flomodel.wealth.gof)

Here we are fitting the family wealth with the degree of the edges and plus some other things. Note that we have many families. Thus we will have a distribution of the wealth of these families based on our graph. We can compare the distributions with the observed distributions. From these results, we can infer that this model is pretty good in simulating the dynamic network we have.

# Temporal Networks [10/15]

##### Maira Saboia, Venkatesh Yerneni and Vinayak Karuppasamy -October 15, 2015

Social Evolution Network (matrix)

This lecture will be about examples of Temporal Networks. The first network mentioned here uses social evolution data set. The matrix depicted in the Figure 1 was drawn using this data set, which is about to a group of undergraduate students in a small student dormitory at the MIT Campus. Surveys were conducted at different times, almost at the end of every month. The data selected include locations (WiFi access points) and proximity data. Each matrix in the Figure 1 represents a survey on a certain relationship at a certain time. Each column is indexed by somebody who filed out the survey, and the rows correspond to the people asked about.

From the aforementioned matrix we can see that close friends is a subjective decision; socializing with means spending at least two hours per week and so forth. And since that time was the time of presidential election, we also asked about whether people discussed politics. This is a very typical kind of social network. If we look at those matrix it is possible to see clusters. If we look over time, we find out that, for example, some clusters began pretty small and they grow in size and so on and so forth.

Social Evolution Network (Graph) : Friendship Network (Left – 2008.12, Right – 2009.03)

Figure 2 is a second view of this same network. Those are still the undergraduate students from freshman to senior, and there are also several graduate tutors. In this student dormitory there are four floors, and each floor is separated by a firewall in the middle into two parts, and there is passage in the firewall of the third floor. People on both sides of the firewall could communicate with people on other side.

Students organized in terms of the living sectors. The living sectors are represented with different colors. We can see that there are a lot of interactions between the green people and the yellow people. We also have numbers, which represent different schools years. We can see, for example, the freshman go together three months after the semester began. If we look at the network again, after another three months, although the freshman are still there they get mingled into other group of people in this undergraduate dormitory. There are also other changes.

Suppose that we can ideally ask a survey about the relationships every day. Suppose we have this ground truth, then the question that we can ask are: what is the mechanism for people to form friends and for people to dissolve their relationship? Whether or not we can interpolate the snapshots of the dynamic networks using an event trigger model. And whether or not the interpolation of the networks in between the observations grew with what, supposing that we have a ground truth.

Another question is, if we think about it in terms of KL divergence, how well are our estimations of the snapshots of the dynamic networks between our service? We can also ask what are the different graph rates or different weight that we will use? and what are the sufficient statistics that corresponds to the formation of this kind of network?

Product Space Network CA Hidalgo, B Klinger, A-L Barabasi, R Hausmann. “The Product Space Conditions the Development of Nations” Science (2007) 317: 482-487

Another network, which is also very interesting, is the Product Space Network. The origin of this network is the discussion about the rich people club. The idea is that imagine that we have a forest, and in this forest we have many trees, there are high branches and low branches in the trees, and we have monkeys. The idea is that there are monkeys on the high branches and there are monkeys at the low branches. But normally it is pretty hard for monkeys from low branches to climb up to the high branches. High branch monkeys do not want to go down to the low branches either.

In terms of the GDP (Gross Domestic Product) of different nations, and the richness of different nations, we have the same kind of problem. The idea is why some nations are richer than some other nations. This is because the monkeys of the high branches need to give a hand to the monkeys at the low branches to climb up.

We can observe that we have different industrial sectors and the different industrial sectors are related with one another in the sense that, for example, if I manufacture chips then I will have better access with to software engineering. For example, it takes longer for people to grow vegetables to be able to write code. The reason is that there is big gap between the people who grow vegetables to the people who write code.

It is possible to correlate different industrial sectors and then if a nation has import or export of both type of industry, then the distance is closer. If a country either import or export one kind of thing, or another kind of thing, but never both, then distance between those two is longer.

The red lines represent lines that are pretty close, and this not according to the geometric distance. The red lines are pretty close, and the blue lines are a little further and the yellow lines are even further.

For the countries who have already mastered the production of some products, then this country could very easily navigate from those products to the nearby products. But it will take longer time to navigate from those products that are further away.

The Coauthorship network represents a collaboration network. Each node in this graph represents a person, and the nodes are colored according to the person field of study. The chances of collaboration are represented by the distance between the nodes. On observing this graph, we see that people in the same area with similar colors, collaborate with each other. This can be seen in the graph that all nodes with the same color are clustered together, which makes complete sense that people working in the same field tend to collaborate on research ideas.

Coauthorship Network, M. E. J. Newman. “Coauthorship networks and patterns of scientific collaboration”. PNAS 2004 101

However from a new observation of this graph can be seen that there are chances for collaboration between nodes of different colours depending on their distance from each other. For example, we can see from the graph that nodes in statistical physics are closer to nodes in ecology, and ecology nodes are also close to nodes in nodes in agent based modeling. This can be true as people in ecology also do multi-agent modeling. This kind of modelling can give interesting results like, what will the next paper be about? Or what kind of collaborators can we have for a particular paper? These insights can be used to maximize our benefits in terms of game theory and modern perspective.

One of the many things that such networks are used for is trying to predict their evolution overtime, by looking at it snapshot by snapshot. Interpolating the snapshots will give us a prediction about the growth of the network and also explain the parameters of their growth. These networks also give a lot of information about their asymptotic properties, like what would be the properties of the network if it were infinitely large. An interesting property of this network in this regard is that, while keeping the density of the network constant if we keep on increasing the number of edges, we will get a connected network, i.e. the network will consist of only one component, which is a surprising property of this network.

There are many other properties, which is why these networks are important. Reasoning about these properties give useful insights. If we let the size of the network to go to infinity, meaning we are dealing with a network of internet, that is basically one component. Another network of a similar type is the small world network. This network tells us that the degrees of separation between any two people in the world is 6, i.e. any two people in the world are connected with atmost 6 people in between. We have a regular connected network which is like an abstraction. That for example that people are connected with one another through their physical distances. A person has a higher probability of being connected to then people within, for example, 10 miles away from himself, or in a social structure a person is more likely to be connected with another person working in the same organization.

This kind of regularly connected network, or regular graphs of any dimension are just abstractions of such ideas. The way to form this idea is to randomly rewire the edges. Take all edges from one person to another person in a grid. And then randomly rewire this into another node randomly sampled in a whole network. In this manner, given any two nodes, we will use those randomly rewired edges to go very quickly from one place to another, and use the regular edges, the green edges to navigate at the nearby locations, and that is one way to quickly get from one person to another person. The average distance between any two people grows as the logarithm of network size. It is the size of our social network.

The interesting thing is, those kinds of things and those kinds of properties and other properties that where mentioned can be either found out analytically if there is an analytical solution, or can be found through simulation. What we have here is this nice property about average path length, and also in this kind of network we have very large cluster coefficient. Meaning that if A knows B, and B knows C, then A knows C. In this world we have the nodes of forming different clusters.

Erdős–Rényi model

Watts-Strogatz model

Barabasi-Albert model model

What we will talk next is about this “Game Theory” perspective of a dynamic random network. We have a group of people and each person has certain properties, which can be considered as their resources. They might choose to be connected to another person or not connect to another person, or dissolve a previous edge. For each connection there is a cost to maintain this connection. So for each connection there will be a gain, the gain comes from sharing resources and sharing information; there will also be a cost, the cost comes from maintaining the edge. Each person in this network can only know the other people around. The idea is how can different people dynamically choose to connect to other people or not connect to other people over time, and whether or not there will be an equilibrium in the end.

#### Players, Networks, Chains and Components

What we have in this type of network is that we’ll have a set of players which are the nodes, and also we’ll have a set of networks and each network is connected to each other through a chain. This is the formulation of this kind of network:

• Let $N=\{1,…,n\}$ be a set of players
•  Let $\{g:g\subset g^N\}$ be a set of networks, also called graphs, where $g^N=\{ij:i,j \in N,i \neq j\}$ is the complete network and ij is a link
• Let$N(g)={i:\exists j, s.t.\ ij \in g}$ be players involved in at least one link, $n(g)$ be the cardinality of $N(g)$
•  A chain $i_1,i_2,…,in$ in $g$ connecting $i_1$ and in if ${i_1,i_2,…,i_n}\subseteq N(g)$ and ${i_1i_2,i_2i_3,…,i_{n-1}i_n}\subseteq g$
• A nonempty network $g’$ is a component of $g$ if
$i \in N(g’)$ and $j \in N(g’) \Rightarrow a$ connecting $i$ and $j$, and
$(no two players would benefit by forming a new link) • A sequence of adjacent networks$g1=g,g2,…,gK=g’$is an improving path from$g$to$g’$if for any$kgk+1=gk-ij$for some$ij$and$Yi(gk-ij)>Yi(gk)$,(one link is deleted and at least one involved player strictly benefits from deletion) or$gk+1=gk+ij$for some$ij$and$Yi(gk+ij)>Yi(gk)$and$Yj(gk+ij)\geq Yj(gk)$(one link is added and one involved player strictly benefit and the other doesn’t lose) • A set of network$C$is a cycle if there exists an improving path from any$g\in C$to any$g’\in CC$is a maximal cycle if it is not a proper subset of a cycle$C$is a closed cycle if no network$g\in C$leads to an improved network not in$C$. The question is whether or not they will reach one specific absorbing state, meaning that at that state nothing will change. In other words, they will end up with some kind of equilibrium in the sense that the statistics will forget the previous configuration a long time ago and it is something that has a detailed balance. Those are the two different considerations that we have. Similar to what we have coped with in a Markov chain we might end up with a cycle in a sense that we end up with some periodic Markov chain, we go from A to B from B to C and from C to A. Those are the different configurations that we have in terms of network game. One example of this network game (Trading Game) looks like following: Trading Game We have four players, each player can have two goods$(0,1)$or$(1,0)$with probability is$1/2$. The utility of each player is$U(x,y)=x*y$, for x and y amount of different goods (expected utility per player is$1/8$,$1/6$and$3/16$for components$1, 2$and$3$players). The players will tend to either form an edge or dissolve an edge in order to maximize the utilities of themselves. Trade can only happen along the network and the cost of a link is$5/96$per player. We can play this game and we will find out that this network form a cycle, the players will repeatedly try to optimize the utility function, but in the end the greedy choice of the players will just bring the network dynamics in a circular fashion. So this is one example of the network game. Another example of the network game is the Co-author network. In this network each player$i$has$n_i$project. Suppose that each player spends$1/n_i$time on each project equally. Each player can collaborate with another player on exactly one project. It depends on how the player choose to build an edge or to dissolve an edge. When one player has collaboration with another player there is a synergy (utility of collaboration) which is expressed in terms of$u_i(g)=\sum_{j:ij\in g} 1/n_i + 1/n_{j}+1/n_{i}n_j$. The idea of this network is: what is the best way for players to choose the number of projects and to choose the connections of this network in order to maximized the utility or the productivity of different users? One thing that we know is that for any type of network game, either the game will end up with one absorbing state or will end up with a cycle of many absorbing state. Pairwise Stable Network vs. Closed Cycle of Networks Claim: for any$v$and$Y$there exists at least one pairwise stable network or closed cycle of networks. Proof: • Starting network g • Either leads to a pairwise stable network or • Or leads to a cycle C. • We can expand C into a maximal cycle, • Which is closed cycle if no pairwise stable network exists. This one says that for any network valuation function,$v$(the utility of this network), there exists at least one pairwise stable network. Which is just one state, or a closed circle of networks. Players in the network continues to add an edge or remove an edge, and then form a cycle. The idea to show this is actually pretty simple. It starts from a random network and then a player goes one step. Either this state is absorbing state, meaning that nothing will change or this specific network is not absorbing state. So if it not absorbing state we’ll go to another network in this graph. Since there is only a finite number of network configurations in this plot, we will end up with going back to some state. Either the first state or some state in the middle. If we go to some state in the middle, we ultimately get a cycle from the different types of configurations. This is the prove of this one. # Opinion Dynamics, Culture Dynamics & Language Dynamics [10/20] Prepared by Jing Zhang, Bikram Kashyap and Buddhika Laknath Semage This lecture talks about diffusion in social networks. The type of tools that we’re using is simulation and some mathematics. Those are the related literature, the most impactful literature that clipped on diffusion. ### Models and Tools -Claudio Castellano. Santo Fortunato and Vittorio Loreto (2009), Statistical Physics of Social Dynamics, Reviews of Modern Physics 81, 591-646. -Thomas Liggett. [Interacting particle systems](http://catalog.lib.buffalo.edu/vufind/Record/000681209). Vol. 276. Springer Science & Business Media, 2012. -Seth Tisue, and Uri Wilensky. “[Netlogo](https://ccl.northwestern.edu/netlogo/): A simple environment for modeling complexity.” International conference on complex systems. 2004. – N. A. Christakis, & J. H. Fowler (2007). The spread of obesity in a large social network over Centola, D. 2010. The Spread of Behavior in an Online Social Network Experiment. science 329(5996):1194 -Stephen Eubank, et al. “Modeling disease outbreaks in realistic urban social networks.” Nature 429.6988 (2004): 180-184. The first one is a summary of all kinds of models about diffusion. Models about diffusion can explain the formation of language, for example, culture and other type of things. The second one is about the interacting particle systems and the idea is that if we have just one particle and if each particle can take two states, and if we have a large system of many of those not interacting particles, then the system is pretty easy to understand, because it’s just the product dynamics of the individual particles, but if we impose some interaction among those particles, then, the system becomes harder to understand,. Also, because this type of problem maybe very hard, especially if we cope with a smaller system. If we cope up with a larger system, then we could think of what would be the equilibrium behavior and how many equilibrium states there will be. If we let the system evolve from arbitrary initial state, then how many final states there will be and how many attracting states? Those are the attracting states, so given initial state, how can we know what will be the final state of the system up to the evolution? The third one is a simulator. NetLogo is agent-based stimulator. The configuration of NetLogo is a space, a two-world space with grid, so we have cells in our matrix and the cells in the matrix can interact with the neighbors. We can define the neighbors in different ways. They can interact with the neighbors in various ways that can be defined with a program language. Also, we have many practical work on real-world data set and this one, the first one by Christakis and Fowler is perhaps the work that makes people begin to realize that this kind of diffusion behavior actually exists in our real-world network. It is on data set collected about the medical records collected in the New England area in Massachusetts, and it is about more than 30 years of data sets. It’s a data set about patients. Each time a patient visited the clinics, we have a recording about the height, weight, and on the behavior of the patients. Since the patients recommend their relatives and their friends to join the medical network, so we’d have a network, and as we get a huge amount of data then, this is the situation we can talk about science because we have a huge amount of data then. This means that we have either central limit theorem or theorem of mass deviation, so the patterns that we can find may be some pattern that can be generalizable. This is a first, a practical study about the social diffusion. The last study about social diffusion is about the behavior. From the previous studies, we know that the behaviors and opinions will diffuse in real-world networks. One opinion is that the more of my neighbors, the more of my friends take certain opinions, then I will have higher probability of take this opinion. Another competitive idea is that the probability for me to take opinion does not depend on the number of friends that take opinion. It is only based on whether or not I am in the influence of opinions. ### Voter Model –$N=\{1,\dots,n\}$are nodes –$E\subseteq 2^N$are edges –$X_i(0)=\pm 1$for$i\in N$Interate over$t$: Choose j randomly from$\{j: (i,j)\in E\}$with probability$1 \big / \sum_{(i,j)\in E} 1$, and$X_i(t+1)\leftarrow X_j(t)$The first model that we want to talk about is what people call a voter model. A voter model works in the following way. We have a node and we have an edges on those nodes. We could have N times N directed edges, and each edge could exist or not exist. This is why we have 2 to the N number of potential edges. Each node could take one of two states, either +1 or -1, so they are voting. They are either voting for +1 or voting for -1. This is the set-up of our work. We have nodes and we have edges, and the nodes could take either +1 or -1, and the dynamics of this voter model works in the following way: we iterate through time, at any time, we randomly pick a node and after we random pick a note, we randomly pick one of his neighbor, it could be any neighbor, uniformly, and then, we change our opinion to the opinion of our neighbor. These kind of model captures the behavior that a person, the opinions, the randomness spread from one person to another person. #### Modifications: – Adding “zealots” who do not change opinions – 3 States per agent, leftists, centrists and rightists interactions involve only centrists leftists and rightists do not talk to each other. – Introducing memory — Changing opinion only after encouraging opposite opinion$\ge r$times Modifications of these models could be done to capture a more advanced behavior include, for example, we have a certain amount of zealots in a system and those zealots do not change opinions. This could be true, right? The idea is how does the zealot affect the behavior of this network? Another modification could be that we have, instead of two states for each node, we have three states. We could have each agent, each node could be a leftist or a rightist, or a centrist. The leftist could bring a centrist to the left side and the rightist could bring the centrist to the right side, but the leftist never interact with the rightist, and the rightist never act with the leftist. Again, the idea is what will be the dynamics of this model. Those are the ways that have some impact. This means that those ways have been adopted by many persons in their studies. Another way is for us to introduce our memory. This means that we only change, a person in the network only changes his or her opinion after he or she has already heard of this opinion for many times, heard of either +1 or -1. In the example that we mention, it could be that a certain amount of people have already told him about the new opinion. The questions that we want to ask in such kind of models as, for example, whether or not all people in this system will agree on on a certain stand meaning, whether the system will cover still either all +1 or all -1, or something. This is one thing that we could ask. If it does not happen, the next question is whether or not this system will ultimately get to some stationary state after some time, or this system will be evolving randomly or purely adequate. This is the second question that we could ask. The third question that we could ask is that whether this system will become more, will evolve, will form more order or less order will be more chaotic or less chaotic. Those are all very good questions that we could ask. If we think about the book on interacting particle systems, the hardest thing is first, how can we formulate this problem mathematically? After we formulate this program mathematically, how can we solve those problems? Those problems, although they sound pretty easy and intuitive, but they are actually pretty hard. Constantly, the voter model, what we found is that starting from a chaotic state. The system will evolve and it ultimately reaches some state with some stationary, with some order there. This is a voting model written by NetLogo [http://ccl.northwestern.edu/netlogo/models/Voting], and so we can actually conduct some experiments. If we evolve this voter model, will ultimately gets to some stationary state and in different scenarios we’ll get different conclusions, and in some scenarios, we will be able to get a consensus in this system. Another model that we want to talk about is called this bounded confidence model. Again, we have a system, a network of nodes or people. In this network, each people could have opinion, and this opinion is a number randomly distributed between zero and 1. Bounded Confidence Models (http://ccl.northwestern.edu/netlogo/models/community/bc) People only talk to those with similar opinions –$N=\{1,\dots,n\}$are nodes –$E\subseteq 2^N$are edges –$X_i(0) = x_i(0) \in [0,1]$Interate over$t$: Choose j randomly from$\{j: (i,j)\in E\}$if$|x_i(t)-x_j(t)|\ge\epsilon$nothing happens at time$t+1$, otherwise –$X_i(t+1) = X_i(t)+\mu\cdot \left(x_j(t)-x_i(t)\right)$–$X_j(t+1) = X_j(t)+\mu\cdot \left(x_i(t)-x_j(t)\right)$Properties of Bounded Confidence Models Number of clusters is approximately$1 / (2\epsilon)$, because at stationarity, people in different clusters cannot talk to each other This model evolves in the following way. At each time we iterate over time$T$, at each time$T,$we randomly choose a pair of nodes,$I $and$ J,$and if the opinions of$ I$and$J$is greater than some threshold then, it will not influence one another. However, if the opinions of those two people are within some SigmaXL, then they will change their opinion, convert their opinion according to the opinion of other people. This is the bounded confidence model. Again, the question that we want to ask about this bounded confidence model is that what will be the distribution of the opinions ultimately in this system? Those are mathematical problems, but the first is, to start thinking about this model is to first proceed with some stimulation and find out what will be the result. NetLogo User Community Models [http://ccl.northwestern.edu/netlogo/models/community/bc]. At each time step, we randomly select two people among all peoples, and if the opinions of those two people are too far away, then, those people do not change opinions, but if the opinions of those people are close enough, then their opinions will move closer to each other. Then, this means that if the opinions of any two persons are close enough, then, their opinions will converge. So, as time goes on, there will be certain attractors in the system, and those attractors will collect together the nearby opinions. That is why we ultimately get some polarized opinions in this system. Majority Rule Models A group of$r$agents is selected at random, and all agents in the group take the majority opinion.$\Rightarrow$System could have concensus$E(X)=\pm 1$, unstable state and metastable state. [Models based on Social Impact Theory] (http://ccl.northwestern.edu/netlogo/models/community/Sitsim) –$N=\{1,\dots,n\}$are nodes –$E\subseteq 2^N$are edges –$X_i(0)=\pm 1$for$i\in N$Agents change opinions according to the opinions of distance, persuasiveness, and supportiveness of other agents. The other kind of model about the opinions is that, again, here, we have the agents. They are in a network, so they have distances, and the distances could be defined in various ways such as the number of hops, one node to another node, or if the network, the distance is defined in terms of distance, then we can find out a circle and find out the groups within the circle. What we do in this majority role model is that we have a system of people and each people could take one of two opinions, either plus or minus. Each two people have a distance. At each time step, we will random pick a person, and then, randomly pick a radius among this person. This is the group that we want to consider. Then, we change the opinions of all of those persons into the majority of their opinions. This is how this model works. Of course, we can also work with a more complicated model, and the beauty of a simple model is that a simple model will be more generalizable to other situations, and if we have a lot of constraints about a model, then the generalizability of the model is limited. Based on the social impact theory, an agent change opinions according to opinions of the distance such as in the majority role or could also be based on the persuasiveness of the other people, or based on the supportiveness of the other agents. The agent could persuade another agent. An agent could pursue another agent to change opinion or could support the same opinion of another agent. This is the idea. In a social network, we normally have. It’s normally a small world, cluster network. We have clusters and in the clusters, people are cohesively, tightly connected, and between the clusters, we have wear ties. There are also gateway people and the gateway people tends to propagate the ideas from one cluster to another cluster. As social impact theory actually captures this type of dynamics. Another book which I haven’t listed here because it’s just a pop science book. I think it’s called Structural Hole. It’s also a pretty interesting book about the dynamics of social networks and the diffusion of information and ideas within a social network. We have talked about the opinion dynamics. The next topic that I want to talk about is culture dynamics. Culture dynamics is actually not very different from opinion dynamics. The only difference is that in opinion dynamics, each person can have one dimension which is opinion about something. It could be the + or 1, it could be a range, but in culture dynamics, we have a culture is composed of multiple opinions. We could have opinion one to opinion N, and each opinion could take values from a range. It could be the +1, -1 could be within a range. ### Cultural Dynamics – What are the microscopic mechanisms that drive the formation of cultural domains? – What is the ultimate fate of diversity? Is it bound to persist, or all differences will eventually disappear? – What is the role of social network structure? In cultural dynamics models, we examine the evolution interaction of different cultural domains and groups and try to reason their eventual fate. Whether the cultures will combine, or if they will consolidate. What factors are responsible. What are the roles of social network structure? Axelrod Model [http://ccl.northwestern.edu/netlogo/models/community/Dissemination%20of%20Culture] Individuals in a network have –$F$features$\left(\sigma_1,\sigma_2,\dots,\sigma_F\right)$, –$q$traits per feature$\sigma_f=0,1,\dots,q$for$f=1,\dots,F$At each step #. an individual$i$and its neighbour$j$are selected, #. their similarity is estimated$w_{i,j}={1\over F}\sum_{f=1}^F 1_{\sigma_f(i),\sigma_f(j)}$social influence: tendency for individuals to become more similar when they interact #. with probability$w_{i,j}$,$\sigma_f(j)\leftarrow\sigma_f(i)$for$f$s.t. ($\sigma_f(i)\ne\sigma_f(j)$) homophily: tendency for the likes to interact more Axelrod, R., 1997, J. Conflict Resolut. 41(2), 203. Properties of Axelrod Model starting from disordered initial condition, the system will evolved into – the ordered state, in which all individuals share the same traits for all features, or – one of frozen states, with coexistence of different cultural regions. – System will more likely to evolve into many cultural regions if each feature has more traits. At each step, an individual i and its neighbor j are selected, and their similarity is calculated using$w_{i,j}={1\over F}\sum_{f=1}^F 1_{\sigma_f(i),\sigma_f(j)}$. That is we are calculating the similarities based on agreement of the features. If feature f of individuals i and j agrees, then their similarity weight increases, and if the feature f is different, then it has no contribution. Now, with probability wi,j, we change the features of individual j by the features of individual i. Set$\sigma_f(i)=\sigma_f(j)$, with probability$w_{i,j}$for all f such that$\sigma_f(i)\ne\sigma_f(j)$. Repeated application of this algorithm can be used to simulate the evolution of culture groups. Starting with a random distribution of features and the traits, we can run the algorithm and end up with three different kind of evolution: ● All converge into one culture, with same features. ● There will different cultural regions coexisting, in a frozen state. There will be clusters of features for the different cultural regions. ● If each feature has a lot of traits, convergence becomes hard. So we will end up with lot of small clusters, that is, cultural regions. If we try to use tweets to simulate the system, with a small number of tweets per culture, the system will consolidate the people in the culture. With a high number of tweets, this becomes much harder. In a system of people, we have many features, and many tweets for each features, that is, the tweets can be considered traits. With high count of tweets, we end up in the third situation as described above. It is easy to simulate the system programmatically. But getting an mathematical solution is hard. There are entire groups of people working on finding mathematical answers. ### Flock Dynamics Each particle$i$is characterized by its – position$x_i$, – direction$\theta$, and – constant modulus of speed$v_0$At each time step,$\theta_i(t+\Delta t)=\theta(t) _{S(i)}+\xi$, where$\xi\sim U(-\eta,\eta)$particle$i$assumes average direction of particles nearby, with some random perturbation A. Czirok, A.-L. Barabasi, and T. Vicsek, 1999, Phys. Rev. Lett. 82(1), 209 This kind on dynamics model is based on movement of birds and other animals like fish that moves in large groups. What we observe is birds moving in flocks, where a huge group moves together. Each particle/member has 3 properties: ● Position$x_i$● DIrection$\theta$● Constant speed modulus$v_0$At each time step, each particle assumes average direction of nearby particles, with some random deviation. With the passage of time, it is quite obvious that we will end up with clusters of particles moving in the same direction. Also if the particles are too fast, and the neighbourhood radius is small, then the interaction between particles will be very weak, and clusters will not form. If the particles are fast, and the neighbourhood is not to small, we can end up with just a single large cluster. NetLogo has a simulator where we can play with different parameters and simulate the system. The flock model might be too complex to be analytically analyzed, and there may not be any solid work on the analytical solution of the ultimate behaviour of the flock. Flock Dynamics Simulation http://ccl.northwestern.edu/netlogo/models/Flocking Properties of Flock Dynamics Order parameter$\Phi={1\over N}\left|\sum_{j=1}^N \vec v_j\right|$varies with noise level$\eta$:$\Phi(\eta)=\begin{cases}
\left[\frac{\eta_{c}(\rho)-\eta}{\eta_{c}(\rho)}\right]^{\beta} & \eta\eta_{c}(\rho)
\end{cases}$where –$\rho$is the density of particles –$\eta_c$is threshold noise for system to evolve into an ordered state The order parameter$\Phi$varies with the noise$\eta$. If the noise level is over a threshold noise level$\eta_c$, then we will have chaotic behaviour, and the flocks will not form. If the noise is less than the critical noise level, then we will end up with consolidated behaviour. The lower the noise, the more coordinated the members in the flock will be. ### Language Dynamics – Processes explaining the emergence, change, evolution, interactions and extinction of languages Sociobiological approach: Successful communicators, enjoying a selective advantage, are more likely to reproduce than worse communicators. Sociocultural approach: Agents can select better strategies exploiting cultural choice and direct feedback from the system. Language Dynamics model is an attempt at exploring behaviors of human languages. With this model it’s possible to simulate events such as formation of languages, evolution and the outcome. There are two schools of thoughts when it comes to language dynamics. 1. Sociobiological approach This approach takes the position that successful communicators enjoy a selective advantage because they can better communicate with each other and they can better coordinate their behavior. It’s based on the hypothesis that social behavior has resulted from evolution and attempts to explain and examine social behavior within that context [1][3]. Applying the phenomena of natural selection in evolution theory, successful communicators are more likely to reproduce than their less successful communicators and become the more prominent trait. 2. Sociocultural approach This approach is based on the concept that human activities take place in cultural contexts, are mediated by language and other symbol systems, and can be best understood when investigated in their historical development [2]. Because of the interdependence of social and individual processes, individual agents can set up better strategies by exploiting culture choice and feedback from the system. #### Naming Game There is a vocabulary of words. Initially all agents in a network have empty memory. Iterate over the following steps until convergence. 1. A pair of neighbor agents are chosen, one as speaker and the other as listener. 2. If the speaker has empty memory, he will generate a new word from the vocabulary. Speaker randomly selects a word from his memory and tell it to the listener. 3. If the listener has the word, he will remove all other words and keep only this one. Otherwise, the listener will remember this word. Baronchelli, A., Felici, M., Loreto, V., Caglioti, E. & Steels, L. Sharp transition towards shared vocabularies in multi-agent systems. J. Stat. Mech. 2006, P06014 (2006). Naming game is a quite well-known interpretation of the language dynamics. It’s a model able to account for the emergence of a shared vocabulary of form-meaning associations through social/cultural learning [4]. It works in the following way. There’s a vocabulary of words, a network and nodes, which are the agents. Initially, the agents have an empty memory. Whenever they encounter something, they will name them and will communicate the name with other agents. At each time, we will randomly choose a pair of neighboring agents as the pair of agents who needs to communicate with each other. We specify one agent as a speaker and another agent as a listener. If the speaker has an empty memory, he would generate a new word from the vocabulary. And if the listener has the word in his vocabulary, then the speaker and the listener have both consolidated the word. The interaction is considered a success, and both agents delete from their inventories all their names except for the agreed name. But if the name chosen by the speaker is not in the memory of the listener, he will think of the word as a new word, and will add the word to his memory (ie: remember the word). Properties of Naming Game Starting from random choice of words by networked individuals, global consensus will eventually emerge. –$N_w(t)$, total number of words in the population –$N_d(t)$, total number of different words in the population –$S(t)$, success rate Now that we have established the rules of the game, let’s look at the dynamics. What has been observed in this model is that during the exploitation stage, the total number of active words in the system will first increase. Then it will decrease and the total number of different words will also decrease. But the successful rate of this system will gradually increase. #### Evolution Language Game –$n$objects –$m$words production matrix$p_{i,j}$probability for speaker to use word$j$when seeing object$i$comprehension matrix$q_{i,j}$probability for lisener to associate word$j$to object$i$language$L=(P,Q)$fitness$F(L_1,L_2)={1\over 2}\sum_{i,j} p_{i,j}^{(1)}q_{j,i}^{(2)}+p_{i,j}^{(2)}q_{j,i}^{(1)}$associate matrix$a_{i,j}$number of times parent associate object$i$with word$j$–$p_{i,j}=a_{i,j}/\sum_l a_{i,l}$–$q_{i,j}=a_{i,j}/\sum_l a_{l,j}$In order for a set of agents in a network to form a language, the set of agents should choose some strategy to be able to think in an understandable way through the above naming game. This is the idea behind evolution of language game. There are N objects and M words. We want to use the M words to represent the N objects. There are two matrices. 1. Production matrix – This is at the speaker’s side. Each row represents an object and each column represent a word. Whenever we see an object, we have a distribution of the words, which represents the likelihood of the speaker to choose certain words. An entry in this matrix –$p_{i,j}$¬ represents the probability for speaker to use word j when explaining the object i. 2. Comprehension matrix – This is at the listener’s side. Whenever a listener gets a word, he will map the word to a set of objects in the system. An entry in the matrix –$q_{i,j}$represents the probability for the listener to associate word j to object i. The fitness of this system will be represented as the probability for a pair of listener and the speaker to understand one another in both ways.$F(L_1,L_2) = \frac{1}{2}\Sigma_{i,j} p_{i,q}^{(1)}q_{j,i}^{(2)} + p_{i,q}^{(2)}q_{j,i}^{(1)}$The first term is the average success rate for person J to understand person I. The second term is the average success rate for person I to understand person J. The combination of two terms gives the fitness of this system. This system evolves in a way that the parents will have kids in a genetic model and the kids will take the production matrix or the comprehension matrix from the parents with some random permutation. In the end, our goal is to evolve this language so that the rate — the fitness of this system increases. ### References [1] https://en.wikipedia.org/wiki/Sociobiology [2] http://webpages.charter.net/schmolze1/vygotsky/johnsteiner.html [3] http://plato.stanford.edu/entries/sociobiology/ [4] http://openaccess.city.ac.uk/2659/1/In-depth%20analysis%20of%20the%20Naming%20Game%20dynamics.pdf # Projects & Project ideas In this post I collect together projects and ideas from the class and from myself. My criteria is that (a) a project should be a good complement to the content of this course, (b) can contribute to your career development, (c) are well defined and can be well managed and (d) can be finished in 4 weeks with 5 hours per week. • Review data sets that track the locations and proximity of people with mobile phones and other types of sensors: Where can those data sets be downloaded? What are the works on each data set? What are the interesting findings, innovative methodology and significant applications of these data sets? Sum of 10-based logarithm of the data size (in bytes) >100, references > 50 from most cited works to least cited works. Tell a cohesive story about these data sets. • Review social media data sets: Where can those data sets be downloaded? What are the works on each data set? What are the interesting findings, innovative methodology and significant applications of these data sets? Sum of 10-based logarithm of the data size (in bytes) >100, references > 50 from most cited works to least cited works. Tell a cohesive story about these data sets. • Review one of the open source simulation softwares: Demonstrate how to use the software by walking through one example, show the architecture of the software (modules and their interactions, entry points, key modules, etc.), show how to dump events or system events from the software. Those software could include UrbanSim, MATSim, TranSIMS, MetroPolis, STEM, etc. • Define the dynamics of network evolution through a set of events, simulate network evolution, make inferences from observations using either variation method or MCMC. Hint: we can follow the examples in the problem sets, work with fewer events, no need to work with real world data. • Define the dynamics of diffusion through a set of events, simulate network evolution, make inferences from observations using either variation method or MCMC. Hint we can follow the examples in the problem sets. Hint: we can follow the examples in the problem sets, work with fewer events, no need to work with real world data. • Identify a real world data, postulate the dynamics, and conduct inference using one of the algorithms discussed in the lectures. • more to come…. # Gibbs Sampling and Metropolis-Hastings Sampling [10/06] Prepared by Aayush Uppal, Madhur Gupta and Nishchita Purushothama Patel ### Gibbs Sampling #### Introduction to Monte Carlo Sampling Gibbs Sampling is a Sampling technique, So the first fundamental question is, Why Sampling ? Gibbs Sampling allows us to sample from a distribution that asymptotically follows P(π|X) (Prior information) without having to explicitly calculate the integrals. Gibbs Sampling is an instance of a Markov Chain Monte Carlo technique. Monte Carlo methods are algorithms that help you obtain a desired value by performing simulations involving probabilistic choices. Understanding Monte Carlo technique with a simple example to estimate value of$\pi$Draw a perfect square on the ground. Inscribe a circle in it i.e. the circle and the square are centered in exactly the same place, and the circle’s diameter has length identical to the side of the square. Now take a bag of rice, and scatter the grains uniformly at random inside the square. Finally, count the total number of grains of rice inside the circle (call that C), and inside the square (call that S). Therefore we can say that:$\frac{C}{S} \approx \frac{\pi(\frac{d}{2})^2}{d^2}$. Solving for$\pi$, we get$\pi \approx \frac{4C}{S}$We just solved a problem by approximating the values of integrals. The more grains of rice we use, the better our approximation will be. Understanding the Markov Chains aspect In the above example the distribution was uniform, now we will assume that the distribution is not uniform, also we are interested in$E_{p(x)} [f(x)]$The expected value is represented as:$E[f(z)] = \int f(z)p(z)dz$. Conceptually, the integral in equation sums up f(z)p(z) over infinitely many values of z. But rather than touching each point in the sum exactly once, if you sample N points z(0) , z(1) , z(2) , . . . , z(N) at random from the probability density p(z), then.$E_{p(z)}[f(z)] = \lim_{N\to\infty} \frac{1}{N}\sum \limits_{n=1}^{N} f(z^{(t)})$Looking at this equation, it’s clear that we can get an approximate value by sampling only a finite number of times, T:$E_{p(z)}[f(z)] \approx \frac{1}{T}\sum \limits_{t=1}^{T} f(z^{(t)})$So, now we have a way to approximate the integral. The remaining problem is this: how do we sample$z^{(0)}, z^{(1)}, z^{(2)}, \cdots , z^{(T)}$according to$p(z)$. For our purposes, though, the key idea is to think of z’s as points in a state space, and find ways to “walk around” the space — going from$z^{(0)}$to$z^{(1)}$to$z^{(2)}$and so forth — so that the likelihood of visiting any point z is proportional to p(z). Here g is a function that makes a probabilistic choice about what state to go to next according to an explicit or implicit transition probability$P_{trans}(z^{(t +1)} | z^{(0)}, z^{(1)}, \cdots , z^{(t )})$Now what will make it a Markov Chain Monte Carlo technique is defining things so that the next state you visit,$z^{(t+1)}$, depends only on the current state$z^{(t)}$. i.e:$P_{trans}(z^{(t+1)}|z^{(0)}, z^{(1)},\cdots,z^{(t)}) = P_{trans}(z^{(t+1)}|z^{(t)})$The heart of Markov Chain Monte Carlo methods is designing g s o that the probability of visiting a state z will turn out to b e p(z), as desired. Gibbs sampling is one algorithm that meets the conditions. #### The Gibbs sampling algorithm Gibbs sampling is applicable in situations where Z has at least two dimensions, i.e. each point z is really z = <z1 , . . . , zk>, with k > 1. The basic idea in Gibbs sampling is that, rather than probabilistically picking the next state all at once, you make a separate probabilistic choice for each of the k dimensions, where each choice depends on the other k-1 dimensions.$z^{(0)} := <z_1^{(0)},…,z_k^{(0)}>$for t = 1 to T do for i = 1 to k do$z_i^{(t+1)} \approx P( Z_i | z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i+1}^{(t)}….,,z_{k}^{(t)})$end for end for The distribution we are sampling from by using the definition of conditional probability :$P( Z_i | z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i+1}^{(t)}….,,z_{k}^{(t)}) = \frac{P(z_1^{(t+1)},..,z_{i-1}^{(t+1)} ,z_{i}^{(t)},z_{i+1}^{(t)}….,,z_{k}^{(t)})}{P(z_1^{(t+1)},..,z_{i-1}^{(t+1)},z_{i+1}^{(t)}….,,z_{k}^{(t)})}$#### Gibbs Sampler Stationary Distribution The Gibbs sampler has been used extensively in the statistics literature. It relies on iteratively sampling from a set compatible conditional distributions and the sampler is known to converge to a Stationary distribution. The Markov Chain generated by Gibbs Sampler converges to the target density as the number of iterations become large. P(X) is the stationary distribution of the Gibbs sampler, i.e.$P(\phi) = \int P(\phi| X)P(X)dX = \int P(X, \phi) dX. The reason is the following. \begin{align} \int P(\phi| X)P(X)dX = & \int d X_1 d X_2 P(\phi_1|X_2) P(\phi_2|\phi_1)P(X_1,X_2) \\ = & P(\phi_2|\phi_1) \int d X_2 P(\phi_1|X_2) \int d X_1 P(X_1,X_2) \\ = & P(\phi_2|\phi_1) \int d X_2 P(\phi_1|X_2) P(X_2) \\ = & P(\phi_2|\phi_1) P(\phi_1) \\ = & p(\phi) \\ \end{align} #### Detailed Balance (Principle of detailed balance) Transition probability matrix P and probability distribution\pi$are said to be in detailed balance, or equivalently, the principle of detailed balance is said to hold, if:$\pi_i p_{ij} = pi_j p_{ji} \forall_{i,j} \epsilon SUltimately if the detailed balance equations are satisfied, it implies the markov chain is reversible and it has an invariant distribution. Gibbs sampler has detailed balance if it scans through components in fixed order, then scans through components in reverse order \begin{align} P(X)P(\phi|X) = & P(X) \int d X’_1 P(X’_1 |X_2) P(\phi_2|X’_1) P(\phi_1|\phi_2)\\ = & P(X) P(\phi_1|\phi_2) \int d X’_1 P(X’_1 |X_2) P(\phi_2|X’_1)\\ = & P(X_1|X_2)P(\phi_1|\phi_2) \int d X’_1 P(X’_1) P(X_2|X’_1) P(\phi_2|X’_1)\\ = & P(\phi)P(X|\phi) {\rm~by~symmetry} \end{align} Gibbs Algorithm is a special case of Metropolis-Hastings Algorithm,i.e Metropolis is a generalization for sampling . Metropolis sampling algorithm can draw samples from a distribution of complex random VariableX$.$X$could be a stochastic process or e.g. a process of formation of random network model or whole social diffusion process. ### Metropolis-Hastings Sampling In order to work with Metropolis-Hastings algorithm our only requirement is that we should be able to write down the$P(X)$up to some scaling factor. The Metropolis algorithm first proposes a possible new state$X^{*}$in the Markov chain, based on a previous state$X^{(t-1)}$, according to the proposal distribution$q(X^{*} | X^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution$P(X)$evaluated at$X^{*}$. One constraint of the Metropolis sampler is that the proposal distribution$q(X^{*} | X^{(t-1)})$must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time t, the probability of moving from$X^{(t-1)}\rightarrowX^{(t)}$must be equal to the probability of moving from$X^{(t-1)}\rightarrowX^{(t)}$, a condition known as reversibility or detailed balance. Metropolis-Hastings algorithm implements an additional correction factor \textit{c}, defined from the proposal distribution as $$c = \frac{q(X^{t-1} | X^{(*)})}{q(X^{*} | X^{(t-1)})}$$ The correction factor adjusts the transition operator to ensure that probability of moving from$q(X^{*} | X^{(t-1)})$is equal to the probability of moving from$q(X^{*} | X^{(t-1)})$, no matter the proposal distribution. The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability$\alpha$. Specifically, to draw$M$samples using the Metropolis-Hastings sampler: 1. set$t$= 0, generate an initial state$X^{(0)}\sim\pi^{(0)}$2. repeat until$t$=$M$3. set$t$=$t$+1 4. generate a proposal state$X^{*}$from$q(X | X^{(t-1)})$5. calculate the proposal correction factor$c = \frac{q(X^{t-1} | X^{(*)})}{q(X^{*} | X^{(t-1)})}$6. calculate the acceptance probability$\alpha = \min \left(1,\frac{P(X^{*})}{P(X^{(t-1})}\times c \right )$7. draw a random number$u$from$Unif(0,1)$1. if$u \leq \alpha $accept the proposal state$X^{*}$and set$X^{(t)} = X^{*}$2. else set$X^{(t)} = X^{(t-1)}$Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler. #### Metropolis-Hastings Sampling — Example mh.gamma <- function(n.sims, start, burnin, cand.sd, shape, rate) { draws <- start theta.update <- function(theta.cur, shape, rate) { theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd) accept.prob <- dgamma(theta.can, shape = shape, rate = rate)/dgamma(theta.cur, shape = shape, rate = rate) if (runif(1) <= accept.prob) theta.can else theta.cur } for (i in 2:n.sims) draws = c(draws, theta.update(draws[i-1], shape = shape, rate = rate)) return(draws[(burnin + 1):n.sims]) } mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2, shape = 2, rate = 4) plot(density(mh.draws)) lines(seq(0,2,length.out = 100), dgamma(seq(0,2,length.out = 100), shape=2, rate=4), col='red') legend('topright',lty=1, col=c('black','red'), legend=c('M-H','gamma(shape=2,rate=4)'))  #### Detailed Balance of Metropolis-Hastings algorithm$P(\phi|X) = q(X,\phi)\alpha(X, \phi)$, if$X\ne\phiP(X|X) = 1-\int d\phi q(X,\phi)\alpha(X,\phi)Hence \begin{align} & P({Y\le\phi}|X) = \int_{-\infty}^\phi dY q(X,Y)\alpha(X,Y) + 1(\phi\ge X)\left(1-\int dY q(X,Y)\alpha(X,Y)\right)\\ \Rightarrow & p(X, \phi) = {d P({Y\le\phi}|X) \over d\phi} = q(X,\phi)\alpha(X,\phi) + \delta(X-\phi)\left(1- \int dY q(X,Y)\alpha(X,Y)\right)\\ \Rightarrow & P(X)p(X,\phi) = P(X) q(X,\phi) \min \left(1,{P(\phi) q(\phi, X) \over P(X) q(X, \phi)} \right) \\ & +\delta(X-\phi)\left(P(X)-\int dY P(X) q(X,Y) \min \left(1,{P(Y) q(Y, X) \over P(X) q(X, Y)} \right) \right)\\ \Rightarrow & P(X)p(X,\phi) = \min \left(P(X) q(X,\phi),P(\phi) q(\phi, X) \right) +\delta(X-\phi)\left(P(X)-\int\min dY \left(P(X) q(X,Y), P(Y) q(Y, X) \right) \right)\\ & =P(\phi)p(\phi, X) {\rm~by~symmetry} \end{align} ### Gibbs Sampling from Hidden Markov Models A hidden markov model is defined in terms of the following 3 latent states:X(t)\in \{1,\cdots,k\}$, State transition kernel$p_{ij} = P(X_{t+1}=j|X_t=i)$and the initial state$X_0$, and Observation model$P(Y_t|X_t)$One simple way to sample from the hidden Markov model is to iterate through every sample$X_t$for$t = 1,\cdots,T$. Here we fix everything else and write down the probability of$X_tas given below: \begin{align*} & P(X_t | X_1,\cdots,X_{t-1},X_{t+1},\cdots,X_T,Y_1,\cdots,Y_T)\\ =& (P(X_1,\cdots,X_T,Y_1,\cdots,Y_T))⁄(\sum_{X_t} P(X_1,\cdots,X_T,Y_1,\cdots,Y_T))\\ \end{align*} However in order to make an efficient sampling algorithm it has to satisfy the following 2 conditions: • Sample points should diffuse quickly inside the sample space. • The equilibrium distribution of the sample should be very similar to the distribution that is desired. However the problem with the above sampling algorithms is that the points will diffuse slowly. This is because we fix the previous values and the next values at every state. This means although the above algorithm works, it takes infinite amount of time for it to ultimately mix well to be able to explore the whole probability space. To avoid this, we use the following algorithm: Forward Filtering Backward Sampling Algorithm 1. We first conduct the forward sweep and extract the forward statistic\alpha_t(X_t)= P(X_t|Y_1,\cdots,Y_t)$for$t=1,2,\cdots$2. Next step is the backward sampling stage. Here, we sample$X_T$first and so on until$X_0\begin{align*} & P(X_{t-1}│X_t,Y_1,\cdots,Y_T)\\ =& \frac{P(X_{t-1},X_t,Y_1,\cdots,Y_T)}{P(X_t,Y_1,\cdots,Y_T)}\\ =& \frac{P(X_{t-1},Y_1,\cdots,Y_{t-1})P(X_t|X_{t-1})P(Y_t,\cdots,Y_T|X_t)}{\sum_{X_{t-1}} P(X_{t-1},Y_1,\cdots,Y_{t-1})P(X_t|X_{t-1})P(Y_t,\cdots,Y_T|X_t)}\\ =& \frac{\alpha_{t-1}(X_{t-1})P(X_t|X_{t-1})P(Y_t|X_t)\beta_t(x_t)}{\alpha_{t-1}(X_{t-1})P(X_t|X_{t-1})P(Y_t|X_t)\beta_t(x_t)} \end{align*} # Sampling Multivariates and Processes [10/08] Prepared by Zhu Jin, Srinivasan Rajappa & Yuhui Wu ### Stochastic Kinetic Model Stochastic Kinetic Model is a highly multivariate jump process used to model chemical reaction networks, particularly those in biochemical and cellular systems. This kind of model represents temporal evolution of a system among M populations interacting with one another through a set ofv$events. • V events happen with rates$h_k(x_t,c_k)$where$k=1,…, v$.$c_k$is rate constant ($k$is a index of events) and$x_t$represents M population • Each event k changes population by$Δ_t$The Gillespie algorithm is as following: 1. At time$t=0$, the system is initialized into the initial state:$(x(1),⋯x(M))←(x(1)0,⋯x(M)0)$. 2. Sample time to the next event no matter what event it is:$τ∼Exponential(h_0(x,c) = \sum_{k=1}^v h_k(x,c_k))$. 3. sample the specific event on condition that there is event happening (the minimum of a set of exponential distributions is exponential distribution with the rate which is the sum of the event rates of those individual distributions. On condition that we have event which happens at the minimum of the times of the distribution then we can sample what specific event is happening), and this is a categorical distribution:$k∼categorical(h_1h_0,…,h_vh_0)$. 4. Update time and the states of the populations:$t←t+τ, x←x+Δk$. 5. Repeat steps 2-5 until the termination condition is satisfied. The reason that this algorithm sample many events, take the minimum and then move on is that the exponential distribution is used, which is a memory-less distribution, and at any time of the next event, there is no memory about what was done in the past. This kind of models that involves different populations and that involves a set of events. It is common in both examples that involves a set of events. The events happen with different event rates and the rate involves a different rate constants. The rates also involve the current state of the system. The state of the system is expressed in terms of the different populations. Susceptible-infectious-susceptible dynamics: •$S+I→2I, rate = \alpha\cdot |S|\cdot |I|$•$I→S, rate = \beta\cdot |I|$For example, there are two different populations in susceptible dynamics – susceptible population and infectious population. These two populations interact with each other through two events – infection and recovery. The probability of getting infected is$R$when an individual from susceptible population interact with one from infectious population. Since system has$S$number of susceptible individuals and I number of infectious individuals then the total number of combinations that we can have if we choose one susceptible individual and one infectious individual is going to be$S*I$, the susceptible population times the infectious population. The total rate of the whole system is going to be alpha times susceptible population times the infectious population. It depends on how we look at it. If we look at it at the individual level then the rate is alpha but if we look at it at the system level then the rate is going to be larger than alpha. Similarly we have another event called recovery. The same happens here and also we know that after an infection event called the susceptible in the population is decreased by one and the infectious population is increased by one. After the recovery event the infectious population is decreased by one and the susceptible in the population is increased by one. This can actually be calculated from the chemical equations. Like we have before the reaction we have one I and after the reaction we have two Is. This means that I has increased by one. Here before the reaction we have one S and after the reaction we have zero$S$. This means that$S$is decreased by one. The rate is defined per time united and also could be time unit or a function of time. A lot of sampling-based algorithms as has been mentioned, but there is a challenge that condition on the observations, the sampling does not work so easy, because the sampling algorithm does not assume that there are observations. Thus, the Likelihood of Finite State Continuous Time Markov Process with Complete Observation is introduced. ### Sample rate constants from complete path stochastic kinetic dynamics In a system from time$1$to time$T$, there are$n$events happened:$v_1,v_2,…,v_n$. Those events have happened at corresponding times$t_1,t_2,…,t_n$. At each time this event$vi$happened with rate computed from the time of the previous event as $$h_{v_i}(x_{t_{i−1}},c_{v_i})$$ and rate of all events is $$h_0(x{t_{i−1}},c)=∑_k h_k(x_{t_{i−1}},c)$$ changing state from$x_0$at time$t_0=0$to$x_1,x_2,…,x_n.$The state of this system is always$t_{i-1}$, and between$t_{i-1}$to$t_i$the rate of any event to happen is$h_0$which is the sum of the rate for the individual events. The probability for this set of events and the corresponding happening times is going to be the probability distribution that the time to next event is from$t_i$to$t_{i-1}$. So the probability density function is $$h_0(x_{t_{i−1}})exp(−h_0(x_{t_{i−1}},c)⋅(t_i−t_{i−1}))$$ This is a category distribution which suppose that there is an event happening then what is the event that happened? It’s probability is: $$h_{v_i}(x_{t_{i−1}},c_{v_i})h_0(x_{t_{i−1}},c)$$ After the$n$th event and between the time of the$n$th event to this$t$, there is no further event, the probability of this is $$exp(−h_0(x_{t_n},c)⋅(t_T−t_n))$$ Those are the three parts in the likelihood function. If the$h_0$in the pdf function of the next event with$h_0$of the multinomial distribution is cancelled, all of the$h_{v_i}$are left over, as: $$\prod_{i} h_{v_i}(x_{t_{i−1}},c_{v_i})$$ And then if the exponential distributions is merged, there will be$H_0$from time$0$to time$T$: $$exp(−\sum_i h_0(x_{t_i},c)⋅(t_{i+1}−t_i))$$ So in the likelihood there are information about the jumps ($\prod_{i} h_{v_i}(x_{t_{i−1}},c_{v_i})$) and also the nonjumps ($exp(−\sum_i h_0(x_{t_i},c)⋅(t_{i+1}−t_i))$). If $$h_k(x_t,c_k)=c_kg_k(x_t),$$ there is $$L(c;x)=\prod_ic_{v_i}g_{v_i}(x_{t_{i−1}})exp(−\sum_i\sum_kc_kg_k(x_{t_i})⋅(t_{i+1}−t_i))$$$$∝\prod_{k∈events}c^{rk}_k exp(−\sum_{k∈events}c_k∫_t dtg_k(x_t))$$ where $$r_k=\sum_i 1(v_i=k)$$ Hence MLE of$c_k$is the expected number of event$k$to happen in unit time $$c_k=r_k∫_tdtg_k(x_t)$$ ### Sample from a proposal process with piecewise constant/linear event rate An inhomogeneous or non-homogeneous Poisson process is similar to an ordinary Poisson process. The average rate of arrivals$\lambda(t)$varies over time or in other words it can be understood that rate parameter of the process is a function of time. In the example of Predator-Prey model, the population of predator/prey observed over time can be exhibiting a rate similar to$\lambda(t)$. In such situation where random phenomena occurs we can account this to Inhomogeneous-Poisson process. Property of Independent Increments: Suppose that we have a random process$X=\{X_t: t \in T\}$such that index set$T$is either$N$or$[0,\infty)$. Here X has independent increments if for$t_1, t_2, \cdots, t_n \in T$with$t_1\le t_2 \cdots \le t_n$, the increments$X_{t_1}, X_{t_2}-X_{t_1}, \cdots , X_{t_n}-X_{t_n-1}$are all independent. A process that produces random points is a non-homogeneous Poisson process with rate$\lambda(t)$if the counting process$N$satisfies the following properties: 1. If$\{A_i:i \in I \}$is a countable, disjoint collection of measurable subsets of$[0,\infty)$then$\{N(A_i): i \in I\}$is a collection of independent random variables. 2. If$A \subseteq[0,\infty)$is measurable then$N(A)$has the Poisson distribution with parameter m(A). Some other properties include the following: For finding the number of events between$(t,t+dt]$we can apply the same properties of poisson process for$\lambda(t)$. i.e.$N_t(dt) \sim \mbox{Poisson}(\lambda(t) dt)$This property gives an insight that sum of independent Poisson random variables is a Poisson with rate equaling sum of individual rates at any time$t$. Now having said this, in contrast homogeneous Poisson process have event observations really close to one another. If not then this means that the rate cannot be assumed to be a constant. So if we apply the same notion for a non-homogeneous Poisson process then we may have several plausible paths between event rate at$t_1$and event rate at$t_2$. The path refers to the variable rate that may be assumed at will. Thus this brings us to objective of finding the best fit for a non-homogeneous Poisson process that will estimate the variable rate between two major observations. This engenders the requirement of sampling the random variable. The technique adopted is as follows: 1. First samples from the average of distribution is taken. 2. Later the time is adjusted by stretching there by accounting an accurate depiction of various time involved. The sampling is done according to a uniform distribution. The time stretching exploits the fact that sum of Poisson random variable is again a Poisson random variable. This makes a sound basis for asserting the samples on a path between two observations. This helps to understand the inhomogeneous in terms of homogeneous Poisson distribution. However, we have to mathematically understand in terms of$h$function. Assuming Y(t) is inhomogeneous Poisson process where$\lambda(t) = (1-t)\cdot h_0 + t\cdot h_1$where$t \in (0,1)$and assuming$X(s)$is a homogeneous Poisson process with rate$(h_0 + h_1) / 2$here$s$is a state in the state space. On solving using cumulative hazard function:$M(s) = -\log(T_X > s) = s\cdot (h_0+h_1)/2$and ,$\delta(t) = h_0 \cdot t + t^2 (h_1-h_0)/2$on inspection$\delta(t) = M(s) \Rightarrow t = (\sqrt{h_0^2 + (h_1^2 – h_0^2).s} – h_0)/(h_1 – h_0)$Thus we can infer the following:$X(\sqrt{h_0^2 + (h_1^2 – h_0^2).s} – h_0)/(h_1 – h_0))$has distribution of$Y(t)$The above proof shows the complete observation, complete path based on the two stochastic connected processes. This also helped to find probability path based on our sampling methods. Based on an accept reject filter we may be able to filter the correct probability path. This forms the basis for next sample. Thus we are able to accept a sample for the event rates between two different observations within a time period. ### Metropolis-Hastings Sampling from Discrete Time Observations of Finite State Continuous Time Markov Process Sample from finite state continuous time Markov process and discrete observations: • $X_t\in{1,\dots,k}$ • State transition kernel $q_{i,j}=lim_{t\to 0} {(P(X_{t+\Delta t}=j|X_t=i)-1(i\equiv j)\over \Delta t}$ and initial state $X_0$ • Observation model $P(Y_t|X_t)$ • Observations $Y_0, Y_1, \dots, Y_T$ Plan: • Initialise sample path to be consistent with the observed data. Repeat the following • Sample rate constants from their full conditionals given the current sample path. • For each of the T intervals, propose a new sample path consistent with its end-points and accept/reject it with a Metropolis-Hastings step • Output the current rate constants. Sample between Observations Assuming Homogeneous Event Rates Now we begin to introduce the conception of Homogeneous Poisson process, the basic form of Poisson process is: Counting process $\{N(t),t\ge0\}$ is defined as Poisson process with rate$\lambda$, if: •$N(0)=0$• Independent increments (the numbers of occurrences counted in disjoint intervals are independent of each other) • for any$s,t\ge0$,$P\{N(t+s)-N(s)=n\} =
{(\lambda t)^n e^{-\lambda t}\over n!}, n=0,1,2, …$Consider a homogeneous Poisson process with rate $\lambda$ on the finite interval $[0,T]$. Then the number of events happening in this interval has Poisson distribution, $R\sim{\rm Poisson}(\lambda T)$. In addition, the event times conditioned on $R$ have uniform distribution $U(0,T)$. For any given $\lambda,p>0$, let $N\sim{\rm Poisson}(\lambda)$ and $X|N\sim{\rm Binomial}(N,p)$. Then $X\sim{\rm Poisson}(\lambda p)$$P(X=k) = \sum_{i=k}^\infty P(X=k|N=i)P(N=i)\\
= \sum_{i=k}^\infty {i!\over k! (i-k)!} p^k {(1-p)}^{i-k}\times {\lambda^i e^{-\lambda}\over i!}\\
= \dots = {{(\lambda p)}^ke^{-\lambda p}\over k!}$The following graph shows that Poisson process’s sample path:$T_{n}, n=1,2,3,…$is the nth event’s happening time, let$T_{0}=0$.$X_{n}, n=1,2,3,…$is the time interval between (n-1)th event and nth event. Consider a homogeneous Poisson process with rate $\lambda$ on the finite interval $[0,T]$ Then the number of events happening in this interval has Poisson distribution, $R\sim{\rm Poisson}(\lambda T)$. In addition, the event times conditioned on $R$ have uniform distribution $U(0,T)$. Given $N(t)=n$, the $n$ jumps $T_1,\cdots,T_n$ are uniformly distributed on interval $(0,t)$. First, considering situation when n=1, which means event only happens 1 time in time interval [0,t], for$s\le t$,$\mbox{P}(T_{1}\le s|N(t)=1)=\frac{\mbox{P}(N(s)=1)\mbox{P}(N(t)-N(s)=0)}{\mbox{P}(N(t)=1)}=\frac{\lambda se^{-\lambda s}e^{-\lambda (t-s)}}{\lambda te^{-\lambda t}}=\frac{s}{t}$Which indicates that event happens uniformly for n=1, now we consider situation for n$\ge$2:$\mbox{P}(t_{i}<T_{i}\le t_{i}+h_{i},i=1,\cdots,n|N(t)=n)\\=\frac{\mbox{P}(\mbox{exactly 1 event in }(t_{i},t_{i}+h_{i}],i=1,\cdots,n,\mbox{ no events elsewhere in }(0,t))}{\mbox{P}(N(t)=n)}\\= \frac{\lambda h_{1}\exp(-\lambda h_{1})\cdots\lambda h_{n}\exp\left(-\lambda h_{n}\right)\cdot\exp\left(-\lambda(t-h_{1}-\cdots-h_{n}\right)}{\exp\left(-\lambda t\right)(\lambda t)^{n}/n!}\\
= \frac{n!}{t^{n}}h_{1}\cdot h_{2}\cdot\dots\cdot h_{n}.$Taking derivative over intervals $h_1,\cdots,h_n$, we get the desired results. # Factorial Hidden Markov Model [09/29] Prepared by Alireza Farasat, Simranjot Singh, Devesh Mulmule. The Hidden Markov model (HMM) is one of the basic statistical tools for modeling discrete time series. In an HMM, information about the past is conveyed through a single discrete variable—the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner. An HMM is essentially a mixture model, encoding information about the history of a time series in the value of a single multinomial variable—the hidden state—which can take on one of K discrete values. This multinomial assumption supports an efficient parameter estimation algorithm—the Baum-Welch algorithm—which considers each of the K settings of the hidden state at each time step. However, the multinomial assumption also severely limits the representational capacity of HMMs. For example, to represent 30 bits of information about the history of a time sequence, an HMM would need K = 230 distinct states. On the other hand, an HMM with a distributed state representation could achieve the same task with 30 binary state variables.In the following section we define the probabilistic model for factorial HMMs. ### Factorial Hidden Markov Models — Generative Description We begin by describing the Hidden Markov model, in which a sequence of observations$\left\{Y_t\right\}$where t = 1,…T, is modeled by specifying a probabilistic relation between the observations and a sequence of hidden states$\left\{X_t\right\}$, and a Markov transition structure linking the hidden states. The model assumes two sets of conditional independence relations: • that$Y_t$is independent of all other observations and states given$X_t$, • and that$X_t$is independent of$x_1…x_t−2$given$S_t – 1$(the first-order Markov property). Factorial HMM looks like like presented in the above. Each x in the figure is a latent variable, is represented by a chain indexed by time. So we see that individual chances for each x represented with a superscript evolve as a Markov Chain. Each x could take discreet states like success, failure etc. At each time stamp we have observation that summarizes over the x over all chains. The goals is to make inference about individual chance of given the state transition matrix over all x and the observations. X is a hence a matrix of latent states where columns indexed by chains and rows indexed by time. Y is the observer. Each Markov chain evolve independent of each other with each x taking one of the discreet states, d. Observation Y is hence a Gaussian distribution around linear combination of all X. Therefore the total number of states that X can take is$d^M$(M: number of Markov chains). Using these independence relations, the joint probability for the sequence of states and observations can be factored as. P(\left\{X_t, Y_t\right\}) = P(X_1)P(Y_1|X_1)\prod_{t=2}^{T}P(X_t|X_t−1)P(Y_t|X_t) In factorial HMMs the underlying state transitions are constrained. A natural structure to consider is one in which each state variable evolves according to its own dynamics, and is a priori uncoupled from the other state variables: P(\left\{X_t| X_{t-1}\right\})= \prod_{m=1}^{M}P(X^{(m)}_t |X^{(m)}_{t−1}) In a factorial HMM the observation at time step t can depend on all the state variables at that time step. As mentioned before for continuous observations, one simple form for this dependence is linear Gaussian; that is, the observation$Y_t$is a Gaussian random vector whose mean is a linear function of the state variables. We represent the state variables as K × 1 vectors, where each of the K discrete values corresponds to a 1 in one position and 0 elsewhere. The resulting probability density for a$D × 1$observation vector$Y_t$is $$\begin{eqnarray*} &&P(Y_t| X_t)= |C|^{−1/2}(2\pi)^{−D/2}exp \left\{\frac{-1}{2}(Y_t − µ_t)’C^{−1} (Y_t − µ_t)\right\},\\ \mbox{where} && \mu_t = \sum_{m=1}^{M}W^{(m)}X^{(m)} \end{eqnarray*}$$ Each$W^{(m)}$matrix is a$D\times K$matrix whose columns are the contributions to the means for each of the settings of$X^{(m)}_t$, C is the D × D covariance matrix. The joint log likelihood for the sequence of states and observations hence can be represented as \begin{eqnarray*} && \log P(\left\{X_t, Y_t\right\}) = -\frac{1}{2}\sum_{t=1}^{T}log P(Y_t|X_t) + \sum_{t=1}^{T}\sum_{m=1}^{M}log P(X^{(m)}_t|X^{(m)}_{t-1}) −\frac{T}{2}log|C| -\frac{d.T}{2}log(2\pi),\\ \mbox{where} && \log P(Y_t|X_t) = Y^T_tC^{−1}Y_t – 2\sum_{m=1}^{M}Y^T_tC^{−1}W^{(m)}X^m_t + \sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{-1}W^{(n)}X^{(n)}_tX^{(m)T}_t),\\ \mbox{and} && \log P(X^{(m)}_t|X^{(m)}_{t-1}) = logP^{(m)}X^{(m)}_tX^{(m)T}_{t-1} \end{eqnarray*} where$tr(J)$represents the trace of matrix J. Trace Basics Let A be$n \times n$matrix. Then its trace is given by$Tr(A) = \sum_i A_{ii}$. Some properties/lemma of Trace: •$Tr(A+B) = Tr(A) + Tr(B)$•$Tr(c.A) = c.Tr(A)$•$Tr(A^{T}) = Tr(A)$—diagonal elements are same in$A^{T}$•$Tr(A.B) = Tr(B.A)$$$\begin{eqnarray*} \mbox{Tr}(A.B) & = & a_{11}.b_{11}+a_{12}.b_{21}+\cdots+a_{1n}.b_{n1}\\ & & +a_{21}.b_{12}+a_{22}.b_{22}+\cdots+a_{2n}.b_{n2}\\ & & +\cdots\\ & & +a_{n1}.b_{1n}+a_{n2}.b_{2n}…+a_{nn}.b_{nn}\\ \mbox{Tr}(B.A) & = & b_{11}.a_{11}+b_{12}.a_{21}…+b_{1n}.a_{n1}\\ & & +\cdots\\ & & +b_{n1}.a_{1n}+b_{n2}.a_{2n}…+b_{nn}.a_{nn} \end{eqnarray*}$$ Hence,$Tr(A.B) = Tr(B.A)$•$Tr(X^{T}Y) = Tr((X^{T}Y)^{T}) = Tr(Y^{T}X)$—since$Tr(A^{T}) = Tr(A)$•$Tr(XY^{T}) = Tr((XY^{T})^{T}) = Tr(YX^{T})$•$Tr(X^{T}Y) = Tr(YX^{T}) = Tr(Y^{T}X) = Tr(XY^{T}) = \sum_{ij}X_{ij}Y_{ij}$, which is the dot product of two matrices. • Jacobi’s formula for determinant of matrix is:$\frac {d|A|}{dA^{-1}} = A$### Exact Inference of Factorial Hidden Markov Models Doing exact inference is on Factorial HMM is pretty intractable, especially with a large number of chains where the number of states start exploding. If we do not observe Y in HMM then the chains of X evolve independent of each other. However if we make an observation, then the chains can be seen related as well. This can be explained by the Berkison Paradox — If we restrict ourselves to the cases where events A or B occur where least one of the events A or B occurs, the knowledge that B has occurred makes it less likely that A has occurred or$P(A | B, A or B) < P(A | A or B)$where$0 < P(A), P(B) < 1$(which means that the events have to be interesting enough that$A/B$do not occur with certainty or never). Mathematically, it follows that, given$X^{(m)}$and$X^{(n)}$for$m≠n$are independent, then$X^{(m)}_t$and$X{(n)}_t$for$m\neq n$become dependent given observation$Y^{(m)}_t$about them. Hence exact inference involves enumerating all$(X^{(1)},X^{(2)},…,X^{(M)})$— become computationally intractable when M is large. Suppose a university the admission committee recruits brainy or sporty candidates. A student can either be brainy or sporty or both or neither, however the probability of either is independent. Now, we see that the college normally recruits brainy students and not sporty students. It follows that given that a student is brainy, he is less likely to be sporty. This provides that although the chances X in Factorial Hidden Markov Model are uncorrelated (i.e they evolve independently), but after we perform observation over all the states then the chances can be seen as correlated. Hence we have to estimate joint state of all of these chances together (condition and observation) which leads to an exploding number of states. Say each chain can take 2 states and we have 50 chains then the resultant number here will be$2^{50}$. ### Naive Mean Field Approximation From the exact inference (and Berkson paradox) on factorial Hidden Markov Model, if we do not observe$Y_t$(at time slice t), then the latent states(chains at t)$X_t^{1}, X_t^{2}…. X_t^{M}$will evolve independent of each other. But if$Y_t$is available, then chains are correlated; and hence the number of states will explode. As an example, if a state can take upto K values and M chains are present, then the factorial HMM would require$K^M$states (huge space!), given that observation is available. So, when$Y_t$is available, then for two latent states$X_t^{m}$,$X_t^{n}$we can say,$P(X_t^{m}, X_t^{n}/Y_t) \neq P(X_t^{m}|Y_t) \cdot P(X_t^{n}|Y_t) $. Since, we cannot cope with this joint probability distribution, we need a simpler distribution. The naive mean field approximation is given as distribution Q which approximates the actual distribution of$X_t^{m}$’s with parameters$\theta_t$for each of the chains m. We can express this approximation as:$Q(X_t/\theta) = \prod_t \prod_m Q(X_t^{m}/\theta^{m})$. In other words, we assume that under$Q$the distribution of$X_t^{m}$with parameter$\theta_t^{m}$,is independent of other$X_t^{n}$where$m \neq n$and for each chain m,$X_t^{m}$is independent of$Y_t$. By this assumption, we get rid of all the dependencies in factorial HMM between the chains and ii)between states and observation. Thus, the complexity of the model is highly reduced. This distribution Q belongs to the exponential family; Given that a state can take up to$K$values, then for a chain m, this approximate distribution can be written as:$Q(X_t^{m}/\theta) = \prod_k (\theta_{t,k}^{m})^{X_{t,k}^{m}} = exp(X_t^{(m)T}. log\theta_t^{m})$, where$\sum_k \theta_{t.k}^{m} = 1$. Here, the$X_t^{(m)T}$is a unit row vector$(X_{t,1}^{(m)}, X_{t,2}^{(m)}, .. X_{t,K}^{(m)})$where only one value would be 1 and the rest would be zero, since at time t$X_t^{m}$can exist in only of the K states. Thus, at the end, we will only have one$log\theta_{t,k}^{m}$value. The ultimate goal is to determine the parameters$\theta_t^{m}$for a chain m at time t by minimizing the KL divergence. For computing this KL divergence, some statistics for this distribution Q are defined as follows: • Expected value of Q :$E_Q X_t^{m} = \theta_t^{m}$• Variance of$X_{t-1}^{m}$and$X_{t}^{m}$:$E_Q X_{t-1}^{m} X_{t}^{(m)T} = \theta_{t-1}^{m}\theta_t^{(m)T}$• Co-Variance of$X_{t-1}^{m}$and$X_{t}^{m}$: •$E_Q X_t^{m} X_t^{(n)T} = \theta_t^{m} \theta_t^{(n)T}$—for$m \neq n$•$= diag(\theta_t^{m})$—-for$m=n$(where diag() is an operator that takes a vector and returns a square matrix with the elements of the vector along its diagonal, and zeros everywhere else.) Now we know that log likelihood of actual distribution P is given as: \begin{eqnarray*} \log P(X_t,Y_t) &=& -1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} X_t^{m}\\ &&+\sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{−1}W^{n}X_t^{n}X_t^{(m)T}))\\ &&+\sum_{t=1}^{T} \sum_{m=1}^{M} tr(\log P^{m} X_t^{m} X_{t-1}^{(m)T})-(T/2) \log |C|-(d.T/2) \log (2.\pi) \end{eqnarray*} The KL divergence is the amount of information lost by expressing the actual distribution ‘P’ as the approximate distribution ‘Q’. In terms of entropy it can defined as:$KL(Q||P) = E_Q \log (Q/P)= E_Q \log Q -E_Q \log P$. From the definition of$Q(X_t^{m}|\theta)$,$E_Q \log Q = \sum_t \sum_m \theta_t^{(m)T} \log \theta_t^{m}$. From the variance and covariance statistics of$E_Q$from equation (2) and (3), we can write$E_Q log P$as: $E_Q log P = -1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} \theta_t^{m} +\sum_{m!=n} tr(W^{(m)T}C^{−1}W^{n}\theta_t^{n}theta_t^{(m)T})$$+\sum_{n=1}^{M} tr(W^{(m)T}C^{−1}W^{m} diag (theta_t^{(m)T})))$$+\sum_{t=1}^{T} \sum_{m=1}^{M} tr(log P^{m} \theta_t^{m} \theta_{t-1}^{(m)T})-(T/2) log |C|-(d.T/2) log (2.\pi)$ Here we split out the second term in$E_Q log P$which is $\sum_{m=1}^{M}\sum_{n=1}^{M}tr(W^{(m)T}C^{−1}W^{n}X_t^{n}X_t^{(m)T})$ into two terms considering the co-variance statistics of distribution ‘Q’.\\ Thus we have, \begin{eqnarray*} KL(Q||P) &=& \sum_t \sum_m \theta_t^{(m)T} \log \theta_t^{m} +1/2 \sum_{t=1}^T (Y_t^{T} C^{-1} Y_t -2 \sum_{m=1}^M Y_t^{T} C^{-1} W^{m} \theta_t^{m}\\ && +\sum_{m\ne n}tr(W^{(m)T}C^{-1}W^{n}\theta_{t}^{n}\theta_{t}^{(m)T})+\sum_{n=1}^{M}tr(W^{(m)T}C^{-1}W^{m}diag(\theta^{m}_{t}))\\ && -\sum_{t=1}^{T} \sum_{m=1}^{M} tr(log P^{m} \theta_t^{m} \theta_{t-1}^{(m)T})+(T/2) log |C|+(d.T/2) log (2.\pi) \end{eqnarray*} Differentiating the above KL divergence w.r.t$\theta_t^{m}$and equating to 0 we get,$\log \theta_t^{m}-W^{(m)T}C^{-1}Y_t +\sum_n W^{(m)T}C^{-1}W^{n}\theta_t^{n} +1/2 diag(W^{(m)T}C^{-1}W^{m})-log P^{m}\theta_{t+1}^{m}-log P^{(m)T}\theta_{t-1}^{m}
+const = 0$Hence,$\theta_t^{m} \propto \exp (W^{(m)T}C^{-1}Y_t -\sum_n W^{(m)T}C^{-1}W^{n}\theta_t^{n} -1/2 diag(W^{(m)T}C^{-1}W^{m})+log P^{m}\theta_{t+1}^{m}+\log P^{(m)T}\theta_{t-1}^{m})$Then we use fixed point algorithm to get the values of$\theta$, wherein we iterate over the values of$\theta_t^{m}$until they converge. ### Structured Approximation Mean field approximation presented in the previous section factors the posterior probability such that all the state variables are statistically independent. In contrast to this rather extreme factorization there exists a more realistic approximation that is both tractable and preserves much of the probabilistic structure of the original system. In this scheme the factorial HMM is approximated by M uncoupled HMMs as shown in Figure 2. #### Maximizing Log-Likelihood of Variational Distribution As described in previous chapters, efficient and exact inference is implemented using the forward–backward algorithm within each HMM. Note that the mean field approximation presented in the previous section does not specify the form of distribution. Hence, a structural approximation should be employed using more structural variational distribution$Q$. It is expected that$Q$provides a lower bound on the log likelihood for obtaining a learning algorithm. The only dependency assumed in this method is the dependency of$X_t^{(m)}$on$X_{t-}^{(m)}$and the rest of random variables are assumed to be independent. Therefore, the structured variational approximation is defined as follows:$Q(\{X_{t}\}|\theta)=\frac{1}{Z_{Q}}\prod_{m=1}^{M}\prod_{t=1}^{T}Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta),$where$Z_Q$is a normalization constant guaranteeing that$Q$is a valid probability distribution. Probability of$X_t^{(m)}$given parameter$\theta$and$X_{t-}^{(m)}$is defined as:$Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)=\prod_{k=1}^{K}\left(h_{t,k}^{(m)}\sum_{j=1}^{K}X_{t-1,j}^{(m)}\cdot P_{j,k}^{(m)}\right)^{X_{t,k}^{(m)}}=\exp\left(\left(\log\vec{h}_{t}^{(m)}+\vec{X}_{t-1}^{(m)}\cdot\log P^{(m)}\right)^{\mbox{T}}\cdot\vec{X}_{t}^{(m)}\right),$where$\vec{h}_{t}^{(m)} = \{h_{t,1}^{(m)}, h_{t,2}^{(m)},\ldots, h_{t,K}^{(m)}\}$is a vector indicating the set of parameters corresponding to$\vec{X}_{t}^{(m)}$(note that$h_{t,k}^{(m)}$is the parameter related to$X_{t-1,k}^{(m)}$recalling that$X_t^{(m)}$can take on$K$values). In addition,$P_{j,k}^{(m)}$is the transition probabilities of$X_t^{(m)}$from state$j$to state$k$. It is easy to show that$Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)$follows an exponential family distribution (please see \cite{bishop2006pattern} chapter 2). To characterize$Q(X_{t}^{(m)}|X_{t-1}^{(m)},\theta)$, lets define the following moments (expected values of$X_{t}^{(m)}$,$X_{t-1}^{(m)}X_{t}^{(m){\rm T}}$and$X_{t}^{(m)}X_{t}^{(n){\rm T}}$:$\mathbf{E}_{Q}X_{t}^{(m)}=\gamma_{t}^{(m)}$–$\mathbf{E}_{Q}X_{t-1}^{(m)}X_{t}^{(m){\rm T}}=\xi_{t-1}^{(m)}$–$\mathbf{E}_{Q}X_{t}^{(m)}X_{t}^{(n){\rm T}}=\begin{cases}
\gamma_{t}^{(m)}\gamma_{t}^{(n)\mbox{T}} & m\ne n\\
\mbox{diag}\left(\gamma_{t}^{(m)}\right) & m=n
\end{cases}$By factorizing$Q(\{X_{t}\}|\theta)$, one can rewrite the log likelihood of the variational distribution as follows:$-\log Q(\{X_{t}\}|\theta)=-\sum_{t}\sum_{m}\log\vec{h}_{t}^{(m)\mbox{T}}\cdot\vec{X}_{t}^{(m)}-\sum_{t}\sum_{m}\vec{X}_{t-1}^{(m)\mbox{T}}\cdot\log P_{j,k}^{(m)}\cdot\vec{X}_{t}^{(m)}$Note that maximizing log-likelihood function is equivalent to minimizing -1$\times$log-likelihood function. #### Minimizing KL-Divergence of Variational Distribution Maximizing log-likelihood function is equivalent to minimizing KL-divergence. Assuming$P=P(\{X_{t}\}|\theta)$and$Q=Q(\{X_{t}\}|\theta)$represents true and variational distribution of$X_{t}$. KL-divergence is defined as follows: \begin{eqnarray} &&D_{KL}(Q\|P)=\mathbf{E}_{Q}\log Q-\mathbf{E}_{Q}\log P=\sum_{t}\sum_{m}\log\vec{h}_{t}^{(m)\mbox{T}}\cdot\vec{X}_{t}^{(m)}\\ &&+\frac{1}{2}\sum_{t=1}^{T}\left(Y_{t}^{\mbox{T}}C^{-1}Y_{t}-2\sum_{m=1}^{M}Y_{t}^{\mbox{T}}C^{-1}W^{(m)}\gamma_{t}^{m}+\sum_{m\ne n}\mbox{tr}\left(W^{(m)\mbox{T}}C^{-1}W^{(n)}\gamma_{t}^{(n)}\gamma_{t}^{(m)\mbox{T}}\right)+\sum_{m=1}^{M}\mbox{tr}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\mbox{diag}\left(\gamma_{t}^{(m)}\right)\right)\right)\\ &&-\log Z_Q+{T\over 2}\log |C| + {d\cdot T\over 2}\log (2\pi)\\ \end{eqnarray} In order to minimize the KL-divergence, one need to take derivative respect to the parameters$h_{t}^{(n)}$. This also can be using EM algorithm. In this case, it is more convenient to take derivative respect to$\log h_{t}^{(n)}$(Note that the solution to the KL-divergence remains the same. This can be verified using chain rule). \begin{eqnarray} &&\frac{\partial D_{KL}(Q\|P)}{\partial\log h_{\tau}^{(n)}}=\gamma_{\tau}^{(n)}+\sum_{t=1}^{T}\sum_{m=1}^{M}\underbrace{\left(\log h_{t}^{(m)}-W^{(m)\mbox{T}}C^{-1}Y_{t}+\sum_{l\ne m}W^{(m)\mbox{T}}C^{-1}W^{(l)}\gamma_{t}^{(l)}+\frac{1}{2}\mbox{diag}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\right)\right)}_{\frac{\partial}{\partial\gamma_{t}^{(m)}}}\frac{\partial\gamma_{t}^{(m)}}{\partial\log h_{\tau}^{(n)}}\\ \end{eqnarray} Solving the obtained equations results in the exponential family form of$\log h_{t}^{(n)}$. \begin{eqnarray} &&\Rightarrow\log h_{t}^{(m)}\propto\exp\left(W^{(m)\mbox{T}}C^{-1}Y_{t}-\sum_{l\ne m}W^{(m)\mbox{T}}C^{-1}W^{(l)}\gamma_{t}^{(l)}-\frac{1}{2}\mbox{diag}\left(W^{(m)\mbox{T}}C^{-1}W^{(m)}\right)\right) \end{eqnarray} Indeed, The parameter$h_{t}^{(m)}$obtained from these fixed point equations is the observation probability associated with state variable$X_{t}^{(m)}$in hidden Markov model$m$. Using these probabilities, the forward–backward algorithm is used to compute a new set of expectations for$X_{t}^{(m)}$.\\ The next step is maximization after finding the expected value of latent factors given the parameters. Indeed, minimizing KL-divergence to estimate the latent state includes finding out the expected value of latent states followed by, given the observations what values of parameters maximize the likelihood. How can one estimate the parameters? To estimate the parameters, one need to use maximum likelihood or maximum expected likelihood approach. In this case there exist two sets of parameters$W$and$C^{-1}$(Note that both$W$and$C^{-1}$are matrices). as mentioned before,$W$is used to calculate the expected value of$Y_t|X_t$. If we take derivative respect to matrix$W$and set the equation equals to zero, the result is: \begin{eqnarray} &&\frac{\partial P(\{X_{t},Y_{t}\})}{\partial W^{(m)}}=\sum_{t}\sum_{m}C^{-1}Y_{t}X_{t}^{(m)\mbox{T}}+C^{-1}W^{(n)}X_{t}^{(n)}X_{t}^{(m)\mbox{T}}\stackrel{\mbox{set}}{=}0\\&\Rightarrow& W^{(n)}=\left(\sum_{t,m}Y_{t}X_{t}^{(m)\mbox{T}}\right)\left(\sum_{t,m}X_{t}^{(n)}X_{t}^{(m)\mbox{T}}\right)^{-1}=\left(\sum_{t}Y_tX_t^{\mbox{T}}\right)\left(\sum_{t}X_tX_t^{\mbox{T}}\right)\\ &&\mbox{where }X_{t}=\left(\begin{array}{c} X_{t}^{(1)}\\ \vdots\\ X_{t}^{(M)} \end{array}\right) \end{eqnarray} Similarly, precision matrix of the observation is estimated as follows: \begin{eqnarray} &\frac{\partial P(\{X_{t},Y_{t}\})}{\partial P^{(m)}}=\sum_{t=1}^{T}X_{t-1}^{(m)}X_{t}^{(m)\mbox{T}}\\\Rightarrow&P_{i,j}^{(m)}=\frac{\sum_{t}X_{t-1,i}^{(m)}X_{t,j}^{(m)}}{\sum_{t}X_{t-1,i}^{(m)}} \end{eqnarray} # Expectation Propagation [10/01] Prepared by Ashish Gupta, Utsav Patel and Santhosh Somarapu. Wen Dong will add the math of the lecture into this note or will turn the math into a problem in problem set 3/4. Belief Propagation Belief propagation, also known as sum-product message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Suppose we have this graphical model shown above. We can convert this graphical model into a cluster graph / component graph / clique tree shown below by removing the directed edges as directed edges don’t matter when we apply the forward backward algorithm for inference . The different components are shown in the the diagram below. Cluster Label : A,B,C,D in order A more simplified version of the component graph is shown below. Each cluster has jurisdiction over some variables. For example, cluster A has jurisdiction over variable x1 , cluster B has jurisdictions over variables x1 and x2, cluster C has jurisdiction over variables x2 and x3 and so on. Now these components are going to talk to each other and try to convince each other about the state of the variable which they each have jurisdiction on in common. So cluster A is going to talk to cluster B about the about the state of the variable x1 and try to make cluster B more informed about x1 in order to get a better distribution over variable B. All these clusters are going to talk to each other via message passing. Each cluster sends forward all the incoming message times its current belief to the next cluster as the message and this goes on until convergence, until the variable distributions don’t change much. Approximate Belief Propagation Let alpha denote the forward messages and beta denote the backward messages . Let both alpha and beta belong to exponential distributions. Given statistics, which is the expected value of those features under the distribution, we can find out those alpha. Given those alpha, the parameters, we can find out the expected value of those features. There’s a nice duality between the set of features and the parameters in the exponential family. Exponential family is denoted by a set of features. Assuming exponential distribution for alpha(t-1) , we can propagate the information to alpha(t) and using this we can estimate the features for our exponential distribution and because of the duality between features and parameters for exponential distribution , we can estimate the parameters.Similarly we can propagate from beta(t) to beta(t-1) and estimate the features and the parameters. Stochastic Kinetic Models Stochastic kinetic models (SKMs) are a highly multivariate jump process used to model chemical reaction networks, particularly in biochemical and cellular systems. An example of such system would be: A mixer of some molecules of different chemical species or reactants that are interacting and producing diffrerent chemical products. There can be many possible products generated from all processes and we want to know the possible composition of the mixer after some time. [$\bullet$] we have a combination of M chemicals interacting in a system, we can specify the initial combination of the chemicals (at time t=0) as $x_0 = \{x_0^{(1)}, x_0^{(2)}, … x_0^{(M)}\}$ where$x_0^{(m)}$is the initial quantity of$m^{th}$chemical. [$\bullet$] At time t, the combination is $x_t = \{x_t^{(1)}, x_t^{(2)}, … x_t^{(M)}\}$ [$\bullet$]Among these chemicals, there can be many possible reactions that can occur and change the state of the chemical composition. E.g. $A + C + B \rightarrow D + E$ $C+ E\rightarrow B+A$ $A + E \rightarrow F$ [$\bullet$] We can assume that there are total V reactions and each reaction’s rate is dependent on the quantity of the reactant chemicals at time t ($x_t$) and a reaction constant$C_k$. If the process contains just one reactant, the rate$h_k(x_k, c_k) \propto c_k * x_k$. If there are more than one reactants then the rate$h_k(x_k, c_k) \propto c_k \prod_{i=1}^{M} x_k$. Also each reaction k, changes the composition by$\Delta_k$. If the initial system is converted from $\alpha^{(1)}_k*X^{(1)}+ … +\alpha^{(1)}_k*X^{(M)} \rightarrow \beta^{(1)}_k*X^{(1)} + … +\beta ^{(1)}_k*X^{(M)}$ • Then, $\Delta_k = (\beta^{(1)}_k – \alpha^{(1)}_k, … ,\beta^{(M)}_k – \alpha^{(M)}_k)$ • So we want to estimate the state$x_t$of the chemicals probabilistically using stochastic simulation. Gillespie’s Algorithm Implementation [$\bullet$] Given the initial state of the composition$x_0 = \{x_0^{(1)}, x_0^{(2)}, … x_0^{(M)}\}$, evaluate propensity function(or rate) for all reactions k that can happen at that time:$h_k(x_k, c_k)$. [$\bullet$] From all possible reactions, choose one reaction randomly according to their weightage of rate ($\frac{h_k}{h_0}$) and the time$\tau$untill the next reaction occurs. [$\bullet$] Update the time and available composition of the mixer according to the used reactants and products.$t\leftarrow t+\tau, x\leftarrow x-\Delta_k$[$\bullet$] Repeat step 2 and 3 as long as there are possible reactions that can occur from available composition. Or untill you arrive at the desired time to observe the system. [$\bullet$] Repeating the process several times gives us additional information of how system behaves in intermediate states and what are the possible composition states that occurs frequently. [$\bullet$] In this manner we can simulate the whole chemical chain reaction process as a stochastic model. [$\bullet$] The same algorithm can be applied to many other examples such as: given an initial state of predators and prays in an ecological system, what are the possible numbers of prays and predators that will thrive or survive in the system after some time. [$\bullet$] Mathematically, the probability of a process k occurs with a rate$h(x_k, c_k)$and after a time interval$\tau$is be given by $P(k, \tau, x) = h(x_k, c_k)* exp(-(h(x_0, c_0)\tau)))$ [$\bullet\$] Hence the joint probbility of the system over the whole time period is
$= \prod_{i=0}^n \Big[ h_{v_i}(x_{t_i}, c_{v_i})* exp(-h_0(x_{t_i}, c) * (t_{i+1} – t_i))\Big]$
$= \Big[\prod_{i=0}^n h_{v_i}(x_{t_i}, c_{v_i})\Big]* exp(\Sigma_{i=0}^n \{-h_0(x_{t_i}, c) * (t_{i+1} – t_i)\} )$

Stochastic Kinetic Model — Discretization
In order to make an inference about the observation of a system, and if we want to combine the observations with the dynamics, we’ll have to discretize the events, which is what a lot of people normally do in solving differential equations. Instead of sampling, we just increment the time step by one by one-time unit. We will not sample events.
Here, instead of sampling the next event, we first adding a new event, which means nothing has changed. We’re sampling whether it is a specific event happening, or whether it is a new event happening. Those are the two changes we have made in order to discretize this system.
We have applied the same idea previously in either throwing a coin repeated, or first estimate the time to the next successful, to the next failure. This is just the same. We estimate the time to the next success or next fail, the next weather change directly, change in the weather. All we just do, it’s a step-by-step, step-by-step, and say that’s time Y that hasn’t changed. Time 2 it hasn’t changed until sometime t, the weather changes. It’s the same idea. We can do it either way.
Based on this discretized, discrete-event model. Discrete event meaning they’re only finite at any given amount of time, there are only a finite number of events which have effect, changes of the system that is greater than something. This is always true. There are two different types of changes. One type of change is that in any small moment, we have infinite number of changes. Another change is discrete-event system.
If we count the number of events which are greater than some threshold, there will only be a finite number of events over any finite amount of time. Now we have discrete time, discrete-event model.

Projection — Moment Matching

Here we are coping with approximate messages passing. This means that we will first propagate forward step or in the backward step, we will first have some forward message alpha. Then we will propagate this forward message alpha over this system, so through this \psi, propagate the alpha through this \psi.

After we propagate over the psi, we will find out the sufficient statistics, and we’ll use the sufficient statistics to write new message in exponential form, and so on and so forth. Then Enumerate, all combinatorial states of those states, is pretty implausible. What we do is the following. The system looks like the following. Let’s write three components here.
t-1, t, t+1 and X1,X2,X3… and so on

After once I’ve sampled this event, then we will change the state those latent variables. You want to change the state of the different variables, we need to know the event, and we need to know the previous value of this latent variable. After we know the event, then we know the previous value. We know definitely how it to change this one. If we think of it in terms of inference, we need to enumerate all states, which is not possible. What we actually do is this. We will find the mean effect which is what we want to find.

Here, for example, I know that I’m currently the susceptible, and I have been in contact with 2 random people. What is the probability of that I will be infected? If I’m infected, then it will change my state according to this event. This is how this system works. In this we’ll have to enumerate the combinatorial states.

Here in this system, what we do is here I’m susceptible. Based on my average contact with the other persons, what is my probability of getting infected? This probability does not depend on the specific values of the events, this probability depends on which specific person I am in contact with. Here is what is the, for example, average amount of time I contact with this person? In this way, we decouple the relations of different events As a result, we will be able to make inference that is computationally traceable.

Projection — Parameter Learning

After finding the mean we need to estimate the parameters since in the system there are lot of noise observations and we don’t know the parameters so we need to work on the parameters from the noise observations. To find that first, we write down the probability, which is a factor as the probability after each time steps. After we write down the likelihood in X and Y and then log likelihood, we can write down the expected log likelihood. Of course, in order to write down the expected log likelihood, we need to know the probability distribution of the latent states. Since we do not know, so we just use our previous estimation of the probability distribution of the latent states what we will do is to take the partial derivative of the expected log likelihood over the parameters, which results in taking the partial derivative of this log of P over the parameter.