Prepared by Ankit Sharma, Karthik Kiran and Mohit Arora

The idea is to combine our theories and models with big data sets to obtain inference about them and perform interesting analysis.

Cell phones have become an important platform for the understanding of social dynamics and influence, because of their pervasiveness, sensing capabilities, and computational power. Many applications have emerged in recent years in mobile health, mobile banking, location based services, media democracy, and social movements. With these new capabilities, we can potentially be able to identify exact points and times of infection for diseases, determine who most influences us to gain weight or become healthier, know exactly how information flows among employees and productivity emerges in our work spaces, and understand how rumors spread.

There are many data sets that track communities which are collected from mobile phones. There are many applications to track user behavior and generate very large data sets. The privacy of the user has to be taken care of according to the standards.

These data sets are inter related with one another forming a network of data sets.

Some of the Open source datasets are:

- Nokia Mobile Dataset – Sensors embedded in mobile phones and other wireless devices are used to collect large quantities of continuous data pertaining to the behavior of individuals and social networks.
- Reality Commons – Contains the dynamic of several communities of about 100 people each which were collected in the MIT Human Dynamics Lab.
- MERL (Mitsubishi Electronic Research Lab)Motion detector data set – Data set collected with printed circuit boards worn by the researchers of the MERL for nearly a year which gives the traces in the life of a research laboratory. There were also anchor nodes installed at different corners of a floor. From those kinds of things we can use localization algorithm to find out the exact locations of the workers, for the workers to finish the task, so where have they been, with which other persons they have been interacting and observations of similar kinds.
- Dartmouth Student Life Dataset – The whole Student Life dataset is in one big file. This contains all the sensor data, EMA (ecological momentary assessments) data, survey responses and educational data.
- Telecom Italia Big Data Challenge – The dataset provides information about the telecommunication activity over different places.
- CRAWDAD – Crawdad is a community resource for archiving wireless data at Dartmouth containing many such data sets about networks. Those sensor networks may not necessarily be collected with mobile phones, could be collected another such as embedded devices or printer circuit boards. The data set includes data reflecting interaction among students, students in dormitories, workers etc.

In the data set where mobile phones collect data based on blue tooth proximity, mobile phones can scan another phone within the range which is an indication of proximity. Information about nearby access points within range of the mobile phones are also retrieved in the process. By knowing the location of wifi access points, the location of people can be determined. By knowing the building in which the wifi access point is present, the activity of students can be determined.

Above is a snapshot of blue tooth proximity in the data set. People are grouped into clusters and there are multiple clusters. This is a snapshot of proximity in the data among a group of 100 people who know one another.

The second plot shows the locations of students in one week. The y axis is indexed by hour of a week. The x axis is indexed by different locations representing different wifi access points.

It can be seen that the thick yellow line represents the time when the students are in the dormitory and the other yellow lines represent the classrooms. This gives an overall picture of behavior of students and their interaction.

From the above plots, it can be seen that there is evidence about diffusion in the network. Relationship between activity, proximity and friendship can be obtained.

For each person, distribution of the different access points is plot. There are multiple distributions obtained when data is obtained for many people and the correlation of this distribution gives the behavior and other characteristics of people.

In the third plot, ordering of correlation of non friends and correlation of friends are plotted against each other. What we see here is that when the correlation of friends becomes about 40, the correlation of non-friends is still about 20%. This means that friends have higher correlation about their visit of location. This means that if we know that two persons are friends then we could be able to predict the behavior of one person from the behavior of another person. This means that from the sensor data we can actually be able to predict the behavior of people through their friendship.

In the second plot, the aerobic activity of person A and person B are plotted. From the plot it can be seen that, if A and B are friends then number of aerobic exercises per week of A can be determined by number of aerobic exercises per week of B as they as correlated. So from this dataset, behaviors of friend pairs like physical activity can also be determined. Friends converge on type of behavior.

Another piece of information we have is result from cold and flu surveys. Daily reports are take n from students about cold and flu. First plot is report of fever and second is for running nose (common cold).

Red mark indicates that the student has caught fever/cold on the particular day represented by X-axis.

The count of students catching fever is much lesser than students catching cold. Black marks indicate not reported data which the students might have not been able to report on some days.

What we can do from this data is to how can we track this gradient of cold and flu at the individual level from those symptoms report and from the proximity network. The report generated is for a smaller population and task is to generate models which can infer from this data and apply the results to a larger population.

In order to do that infectious susceptible model can be used.

**Agent-based Susceptible-Infectious-Susceptible model:**

$$S+I\to 2I \mbox{ with rate } \beta\\ I\to S \mbox{ with rate } \gamma$$

This model has two events. One event is called infection and another event is called recovery. Here, S represents a susceptible individual and I represent an infectious individual. Following are the softwares that can be used to simulate an epidemic spreading dynamics:

- Influenza Epidemics Simulation Model: This is actually a model to simulate the spreading of epidemic Influenza over the globe. The input data includes the number of passengers go from one airport to another airport globally, as well as the local movement on highways and on railroads.
- Spatial Temporal Epidemiological Modeler (STEM): It is developed in Java. It also has the capability to simulate the dynamics of epidemics globally but most of the examples are at the city level.

**SIS dynamics:**

Input:

- $G=(N,E) \mbox{: dynamic network} \\

N=\{1,\dots,\mbox{number of people}\} \mbox{: people} \\

E=\{(n_1,n_2,t):n_1\mbox{ is near }n_2\mbox{ at time }t\} \mbox{ : “nearby” relation}$ - $\alpha\sim\mbox{beta}(a,b) \mbox{: probability of infection from outside,} \\ \beta\sim\mbox{beta}(a’,b’) \mbox{: probability of infection from within the network,} \\ \gamma\sim\mbox(a”,b”) \mbox{: probability of recovery.}$

After we have defined the probability for each sample path, we will be able to make inferences.

- $P(Y|X):$ how symptoms are dependent on common cold and flu state.

The output of this is in terms of a matrix, X and Y. X is the latent state. X can take two states, which is either zero, meaning a person is susceptible, and one meaning a person is infected or infectious.

Output:

- A matrix structure $\{X_{n,t},Y_{n,t}:n,t\}$ indexed by time t and node n. Latent state $X_{n,t}$ of node n at time t is either 0 (susceptible) or 1 (infected). Symptom $Y_{n,t}$ is probabilistically dependent on the state $X_{n,t}$.
- A collection of events R drives change in state.

We also have symptoms, which is a vector. Include whether a person has fever, has runny nose, has diarrhea, is stressed, is sad or something like that.

Now we want to understand how this infection spreads. This simulation model works in the following way.

- Sample parameters according to beta distributions. $\alpha\sim\mbox{beta}(a,b) , \beta\sim\mbox{beta}(a’,b’) and \gamma\sim\mbox(a”,b”)$
- $X_{n,1}=0,\forall n\in N$ : all people are susceptible at t=1

After we fix those three parameters, we work with a Markov process:

This means that we first sample the state of all people at time 1. After we have the state of people at the beginning of our simulation, we iterate over time. At each time we will randomly sample the probability for infectious people to be recovered with the parameter gamma. If we have an edge that connects an infectious person and a susceptible person, we will sample the probability of this infectious person to infect this susceptible person. Also, for any susceptible person in this system, we will sample the probability for another person outside of the system to infect this susceptible person in the system.

**Probability Measure Defined by Agent-based SIS Model:**

$\begin{eqnarray} && P\left({X_{n,t},Y_{n,t}:n,t},\alpha,\beta,\gamma\right)\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}P(X_{n,t+1}|\{X_{n’,t}:n’\},\alpha,\beta,\gamma)\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ &=&P(\alpha)P(\beta)P(\gamma)\prod_{n,t}\gamma^{1_{X_{n,t}=1\wedge X_{n,t+1}=0}}\cdot (1-\gamma)^{1_{X_{n,t}=1\wedge X_{n,t+1}=1}}\cdot \\ &&\left[\alpha+\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=1}}\cdot \left[1-\alpha-\beta\cdot\sum_{(n’,n,t)\in E}X_{n’,t}\right]^{1_{X_{n,t}=0\wedge X_{n,t+1}=0}}\cdot\\ &&\prod_{n,t}P(Y_{n,t}|X_{n,t})\\ \end{eqnarray}$

We can observe the four different types of events here:

1. If a person at time T is infectious and time T+1 is susceptible. This is our recovery event (gamma).

2. The probability for an infectious person to be still infectious, is consequently one minus gamma.

3. If a person at time T is zero and at time T+1 is one, this is an event of infection.

4. The total probability of infection from another person within the system and not infected, which means that we have state zero at time T and state zero at time T+1 is one minus this number.

**Gibbs-Sampling of Agent-based SIS Model:**

$\begin{eqnarray} X_{n,t+1}|\{X,Y\}\backslash X_{n,t+1};\alpha,\beta,\gamma &\sim& \mbox{Bernoulli}\left(\frac{P(X_{n,t+1}=1)}{\sum_{x=0,1}P(X_{n,t+1}=x)}\right)\\ &&\mbox{where }P(X_{n,t+1}=1)=P(X|\alpha,\beta,\gamma)P(Y|X)\\ R_{n,t}&\sim&\mbox{Categorical}\left(\frac{\alpha,\beta,\dots,\beta}{\alpha+\beta\sum\limits _{n’}1_{(n’,n)\in E_{t}\cap X_{n’,t}=1}}\right)\\ \alpha &\sim& \mbox{Beta}(a+\sum_{n,t}1_{\{R_{n,t}=1\}},b+\sum_{n,t:X_{n,t}=0}1-\sum_{n,t}1_{\{R_{n,t}=1\}})\\ \beta &\sim& \mbox{Beta}(a’+\sum_{n,t}1_{\{R_{n,t}>1\}},b’+\sum_{n,t:X_{n,t}=0;n’}1_{(n’,n)\in E_{t}\cap X_{n’,t}=1}-\sum_{n,t}1_{\{R_{n,t}>1\}})\\ \gamma &\sim& \mbox{Beta}(a”+\sum_{n,t:X_{n,t}=1}1_{\{X_{n,t+1}=0\}},b”+\sum_{n,t:X_{n,t}=1}1-\sum_{n,t:X_{n,t}=1}1_{\{X_{n,t+1}=0\}}). \end{eqnarray}$

The Gibbs sampling works if we know the probability distribution of one variable condition and condition of other variables. The variables in our system include alpha, beta, gamma, which are the rates of events. We also have x which is the state of the individuals. What we do is we iterate over all persons and all times and we sample those x conditions on everything else. First we sample alpha, then we sample beta, and then gamma. Then we come back to sample all of those x. Theoretically, after we have iterated this sampling process for long enough time, we will be able to get the distribution of not only the rates of the system, but also the states of the individuals and the joint states.

**Variational Inference with Agent-based SIS Model**

The challenge here is that we are coping with a system with a huge number of states. In order to make the problem tractable we need to use some variational influence method. The variational influence method that we have used is called expectation propagation.

$\begin{eqnarray} \alpha(X_{n,t+1}=1) &=& \alpha(X_{n,t}=0)\cdot (a+b\sum_{(n’,n,t)\in E}\gamma(X_{n’,t}))\cdot P(\mbox{obs}|X_{n,t+1}=1)\\ &&+\alpha(X_{n,t}=1)\cdot(1-c)\cdot P(\mbox{obs}|X_{n,t+1}=1)\\ \\ \alpha(X_{n,t+1}=0) &=& \alpha(X_{n,t}=0)\cdot (1-a-b\sum_{(n’,n,t)\in E}\gamma(X_{n’,t}))\cdot P(\mbox{obs}|X_{n,t+1}==0)\\ &&+\alpha(X_{n,t}=1)\cdot c\cdot P(\mbox{obs}|X_{n,t+1}=0) \end{eqnarray}$

The probability alpha, such that the probability for a person n at time T+1 to be infectious which is 1 equals the probability for this person at the previous time step to be susceptible times the probability for this person to be infected between time T and T+1. Another probability, that this person is infectious at time T and this person hasn’t been recovered between time T and time T+1 times the probability observation.

Similarly we can iteratively estimate the probability for a person to be susceptible at time T+1 from the probability of the four statistics, for this person to be infectious or susceptible at time T.

**Validation of Agent-based SIS Model to Fit Data:**

The next thing that we want to do is to validate, to make the case that this model is applicable to the data and the dynamics as we hypothesized. The way to make the case that the model is applicable to the data is through hypothesis testing. Here are the three observations:

First one shows that the infection is related to the network. This means that if I have a lot of friends, who contacts infected today then I will have a higher probability of being infected tomorrow. This is done through a permutation test, i.e.,permute the symptoms of the individuals within a system. This makes the network independent.

The next plot is of people reported certain symptoms and the number of friends with symptoms with its log probability.

Last plot has the probability distribution of recovery versus the theoretical probability, assuming exponential distribution. This means that this hypothesis also works. If it doesn’t work, the probability for us to see those kinds of plots will be very low.

**Performance of Agent-based SIS Model**

Now we will predict symptoms from dynamic network and symptoms from “probes” by comparing our estimation with “ground truth”.

The black line here is the ground truth and the red line here is the result of the algorithm. The blue line here is the result of scaling. It turns out that scaling does not work very well. This is saying that by combining the model with the data set we can actually be able to estimate the infectious population quite well.

**Application to real world:**

In summary, in this lecture we discussed a specific example of how we can combine a specific model with real world data set and to make predictions about it. Once we are able to do that, then we can tweak the network and be able to direct the infection or information flow in the ways that we want to maximize the utility of the network. Similarly we can do many other things, for example, to combine the simulation models about transportation with transporting data set and to apply the behavior data, how people move around in a city and how people interact with one another and combine this type of data with the systems dynamic model about the economic development of a city and talk about how we can tune the behavior and interaction of people to increase the productivity of the city.