# Introduction [09/01/2015]

### What is Simulation Modeling and Why do We Need it?

In this course, we’re going to talk about simulation modeling. What is simulation modeling? We have a world. We do a lot of things to make this world a better world: We build roads, we build cities, and we fight with diseases, and so on. Those are pretty expensive human endeavors because those activities may cause many man-months associated with millions of dollars, and they may have substantial consequences.

How can we make such activities efficient and error-free as much as we can? We do simulation modeling: First we find out the basic patterns of this real world. Then we construct a virtual world that duplicates the real world following the same set of rules. Then we simulate this virtual world, and we conduct thought experiments and reason about how we can improve this virtual world. After we finally find a way to improve this virtual world out of many trials and errors, we can proceed to implement this way in our real world. Why do we think this strategy to improve the virtual world will also improve the real world? The reason is that we have set up the virtual world to run under the same set of rules as the world world. This 1-to-1 map between the virtual world and the real world is our guarantee that whatever runs in the virtual world will be followed by the real world and vice versa. This type of reasoning doesn’t always work — We’re human beings and we make mistakes. We reflect on our past, plan our future, and conduct thought experiments at all times.

Simulation modeling has been widely applied by researchers from different disciplines to assist their scientific reason. We computer scientists need to understand the applications of simulation modeling to design algorithms and help researchers from other disciplines in their thought experiments. In systems biology, it finds its application to model the stochasticity of the chemical reactions happening at the cell level. According to Wilkinson, “modeling is attempt to describe in a precise way an understanding of the elements of a system of interests, their states, and their interactions with other elements” and it “should be sufficiently detailed”. Joshua Epstein — social scientist working on computational epidemics other dynamical systems would conduct research on generative social science by “situating an initial population of autonomous heterogeneous agents in a relevant spacial environment, allow them to interact according to simple local rules, and thereby generate or grow the microscopic regularity from the bottom up”. His approach is derived from from the approach in solid matter and statistic physicists. The idea is that we can make macroscopic observations about our system, such as temperature, or pressure and volume. We want to understand what’s going on there and how we can have pressure, and temperature, and volume. Our approach is simulate modeling: We try to construct the microscopic measure of our system from the microscopic interactions of the atoms. If we can construct the macroscopic phenomena from microscopic interactions of the atoms, we have a hypothesis about the dynamics of this system.

Why do we need models? According to Wilson, we need models because we want to check out our understanding of a particular system, conduct thought experiments and use those thought experiments to inform our design and analysis. According to Epstein, with models “we can check with a hypothesized micro-specification suffice to generate the observed phenomenon”. According to Page in his online course Model Thinking, models can help us involve conversations, weed out the logic inconsistencies, engage and understand the data, and decide, strategize and design our ways to work with the world.

How does simulation modeling work and how computer scientists can make better simulation-modeling tools? Let us conduct some thought experiments.

### Simplified Weather System

The first model is a simplified weather system. In this weather system, we have sunny days and rainy days. If today is a sunny day, tomorrow is more likely to be a sunny day because the weather tends to stay in a state for a while. Similarly, if today is a rainy day, tomorrow is more likely to be a rainy day. Sunny days tend to last longer, and the rainy days tend to last shorter. Let us for the sake of reasoning specify that if today is a sunny day ($X_t=\mbox{sunny}$), then with 90% probability that tomorrow is going to be a sunny day ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})=0.9$) and with 10% probability that tomorrow’s going to be a rainy day ($P(X_{t+1}=\mbox{rainy}|X_t=\mbox{sunny})=0.1$); If today’s a rainy day ($X_t=\mbox{rainy}$), then with 80% probability tomorrow is going to be a rainy day ($P(X_{t+1}=\mbox{rainy}|X_t=\mbox{rainy})=0.8$), and with 20% probability tomorrow is going to be a sunny day ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{rainy})=0.2$). We have from sunny to rainy, 10%, rainy to sunny 20%.

Given this simple model and given that today is a sunny/rainy day, we can simulate/sample this weather system: Given that today’s a sunny day, we can throw a biased coin that yields a head with probability 0.9 and yields a tail with probability 0.1, and we note a sunny day for tomorrow’s weather if we get a head and a rainy day if we get a tail. Given that today’s a rainy day, we can throw another biased coin that yields a head with probability 0.2 and yields a tail with probability 0.8, and we note a sunny day for tomorrow’s weather if we get a head and a rainy day if we get a tail. In this way we can get one sample path “sunny, sunny, sunny”, and another sample path “sunny, sunny, rainy”. Corresponding to each sample path, we have a probability: If today’s a sunny day, then the probability for tomorrow’s a sunny day is .9, and the probability for condition tomorrow’s a sunny day, the probability for the day after tomorrow is a sunny day is also probability of 9. A sequence “sunny, sunny, sunny” is associated with a probability $P(X_{t+1}=\mbox{sunny},X_{t+2}=\mbox{sunny}|X_{t}=\mbox{sunny})=.9\times .9=.81$. Similarly, a sequence “sunny, sunny, rainy” is associated with a probability which is $P(X_{t+1}=\mbox{sunny},X_{t+2}=\mbox{rainy}|X_{t}=\mbox{sunny})=.9\times .1=.09$.

So we have defined a a way to sample system, and we have associated some randomness to different sample paths of this system. In this sense we have defined a generative model which we can use to generate different sample paths with different probabilities.

Now suppose that I at Buffalo chat with my girlfriend at San Francisco everyday. While she doesn’t tell me the daily weathers in San Francisco, she tells me whether she focuses on cleaning her apartment or she hikes around every day. On a sunny day, she is more likely to hike around and on a rainy day he is more likely to clean her apartment. Let us make some inferences about San Francisco’s weather from my girlfriend’s daily activities.

We will use forward-backward algorithm, which comes from Bayesian theorem. Suppose that today is a sunny day in San Francisco, and tomorrow my girlfriend tells me that she’s cleaning her apartment all day. What is going to be the probability for tomorrow to be a sunny day, and what is the probability for tomorrow to be a rainy day? Let $X_t$ represent the latent weather in San Francisco (which I do not know at day $t$) and Let $Y_t$ represent the observation of my girlfriend’s activity at day $t$.

Here’s our reasoning: Conditioned on that today is sunny, the probability for tomorrow as a sunny day is 90% ($P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})=0.9$) and the probability for tomorrow to be a rainy day is 10%. Conditioned on that tomorrow is a sunny day, my girlfriend cleans her apartment with 10% probability ($P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{sunny})=0.1$), and walks outside with 90% probability($P(Y_{t+1}=\mbox{walking}|X_{t+1}=\mbox{sunny})=0.9$). Conditioned on that tomorrow is a rainy day, my girlfriend cleans her apartment with 90% probability and walks outside with 10% probability. Since my girlfriend tells me that she cleans her apartment tomorrow, my girlfriend could clean her apartment on a sunny day, and she could also clean her apartment on a rainy day. The probability for my girlfriend to clean her apartment on a sunny day tomorrow conditioned on that today is a sunny day is $P(X_{t+1}=\mbox{sunny}|X_t=\mbox{sunny})\times P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{sunny}) =.9 \times .1 = .09$, and the probability for my girlfriend to clean her apartment on a rainy day conditioned on that today is a sunny day is $P(X_{t+1}=\mbox{rainy}|X_t=\mbox{sunny})\times P(Y_{t+1}=\mbox{cleaning spartment}|X_{t+1}=\mbox{rainy}) = .1\times .9 = .09$. The total probability for my girlfriend to clean her apartment is $.09+.09=.18$ , and conditioned on that tomorrow my girlfriend cleans her apartment, the probability for tomorrow to be a sunny day is ${.09\over .09+.09}=50\%$ , and the probability for tomorrow to be a rainy day is also $50\%$. This is the kind of reasoning that we engage to work with a stochastic processes.

Now suppose my girlfriend and I have chatted about her daily activities for many, many days. We have a sequence of activities such as “walking, walking and cleaning”. Instead of asking the probability for a specific day to be a sunny day or for a specific day to be a rainy day given daily activities, we ask, what is the most likely sequence of weathers corresponding to the sequence of activities.

The challenge to solve this inquiry is that the size of our search space increases exponentially with the number of days in the sequence: Conditioned on today’s weather, tomorrow’s weather is one of two states, and the weathers of tomorrow and the day after tomorrow will take 1 of  the $2 \times 2=4$ states, and the weathers of the next three days will take one of the $2 \times 2 \times 2=8$ states, and so on. In order to efficiently find out the most likely path, we will use a technique known as dynamic programming. If we know the most likely weather sequence from day 1 to day $t$ ending at state $X_t$ (sunny or rainy) at day $t$, the most likely sequence from day 1 to day $t+1$ ending at state $X_{t+1}$ is the one with the largest probability that passes through $X_t$ and $X_{t+1}$. In this way, we can efficiently compress the space for our search, and thus makes a search in an exponential state space into one with polynomial state space. That is the idea of the Viterbi algorithm.

Now suppose we have estimated the probability distributions of daily weathers at San Francisco from my girlfriend’s daily activities, or we have estimated the most likely sequence. Can we estimate the parameters of our stochastic process model? — What is the likelihood for tomorrow to be a sunny day or a rainy day conditioned on that today is a sunny day or rainy day? Asked alternatively what is the average duration of a stretch of sunny days before the weather changes, or what is the average duration of a stretch of rainy days before the weather changes? What is the event rate of a weather change from a sunny day to a rainy day, or from a rainy day to a sunny day? To answer this question, we inspect $n$ stretches of sunny days that is $N$ days in total. Because the probability for weather change from sunny day to rainy day is 0.1, we expect $n\over N=0.1$ which means the average duration of a stretch of sunny days is $N\over n={1\over .1}=10$ days. Similarly, we estimate the average stretch of rainy days is ${1\over .2}=5$  days. The duration for a stretch of sunny days is longer because the probability for a weather change from sunny to rainy is smaller. What we’re talking about is the relationship of different random variable, a random variable with exponential distribution, a random variable with geometric distribution, and a random variable with Poisson distribution, and a random variable with binomial distribution. We’ll talk about those concepts later.

### Predator-Prey Dynamics

The simplified weather system has very simple dynamics, and the reality is much more complicated. The predator-prey model, although still pretty simple, has found applications in ecology and many other areas. In this model we have two populations: the predator population and the prey population. This system evolves according to three events or “chemical reactions”: First, the prey individual in the system can multiply into two prey individuals. The rate (probability per unit time) for this event to happen on a specific prey individual is $c_1$. (The rate for the whole population is $c_1\times$ prey population.) Second, whenever a prey individual and a predator individual meet each other, the predator will turn this prey into a predator. This event will remove a prey individual from this system and introduce a predator into this system. The rate for this event is $c_2$. Third the predator individual will die. This event will remove a predator individual from the system. The rate for this event is $c_3$.

This type of system exists in reality, and people often observe that the prey population and predator population change periodically, and that the predator population follows the prey population. We will ask, why we have the periodical movement of the prey population and the predator population, how do we preserve the ecosystem, when the ecosystem is near collapsing and with what probability, and so on and so forth. With the dynamics of this system, we can proceed to reason about different ways to introduce intervention and change this system.

We defined a generative model of a dynamics system driven by three events — We defined a way to sample this system, just as we defined a way to sample the simplified weather system: At time zero, we have our initial state, which is comprised of the prey population and the predator population. Then we can sample the next event and change the the predator and prey populations according to the event. Then we sample the next event again, and change the state of the system again, and so on and so forth. Given a sequence of events, we will know the sequence of states of this system in terms of number of preys and the number of predators. Associated with each sequence of events, we also have a probability, which is the probability for this sequence of events to happen. In this sense, this predator prey model is pretty similar to the weather system that we have and the mathematical model. This is a model developed by a chemist.

We can define these dynamics of the system in term of Gillespie algorithm that generates a description, a probability measure of a stochastic process. We can proceed to sample this system. Then we often need to infer the evolution of the system from imperfect observations about this system. Then we can proceed to reason about this system and different interventions and engage data with simulation modeling. Our observation about the predator-prey system is imperfect because we cannot continuously track populations and often we can only track one of the two populations. Our reasoning about a predator-prey system could include: Whether this system is in danger of collapse, Whether it’s going to be a peak of population, whether it’s going to be a valley of population, or if we want to change the dynamics of the system in this or that way what we want to do.

### Bales Interacting Process Analysis

Many times we cope with the data about people, and the dynamics about people are much more complicated. Bales interaction processes analysis model is a model that talks about how groups of people solve problems face to face. Let us consider Computer Supported Collaborative Ware (CSCW in short). In CSCW research, we use sensors to capture images, and audio signal, and motion signal, and other kinds of signal. From these signals, we are supposed to advise the group how this group can improve the performance of problem solving. In other words, we listen to a group of people speaking unknown language — we look at how they interact because we do not know what they talk about, and in the end we want to give the people some other advice about we can solve problems better. This is a tough problem. In order to solve this problem we will use simulation modeling to help us engage gigabytes of data collected from group discussion and generate reasoning.

Bales Interaction Process is one such model: For a group to solve problems, this group needs to involve different types of tasks —tasks such as asking for facts and giving facts, asking for new directions to explore and giving new directions, voting for opinions, and solving disputes. Different tasks will involve different features: If I give a fact, I may talk longer and the other people may be listening; If a group of person are vote for a direction, there will be a lot of overlapping in their speech signals. If there’s a lot of quarreling,  the pitch and the volume may be higher, and so on.

Hence although we listen some language that we don’t understand, with this model we can understand the dynamics, such as whether a person is giving a fact or whether there is agreement, disagreement going on, and whether a group of person is voting, and who and who are on one side, and who and who are on another side. Then in the end of discussion, although we still do not know the content of discussion, we can count, for example, how many facts have been given in this discussion, and how many voting processes are going on, and whether there are a lot of quarrellings or not, and who is on one side and who’s on other side, and so on and so forth. In the end, we can tabulate different activities in the discussion, and then this effectively shrink the search space from one gigabyte of data to perhaps one page of numbers, and being able to estimate the performance of this group through this one page of numbers will definitely be more reliable.

In this example, the weather system example and the predator prey model, we start from defining a probabilistic model driven by a set of events, proceed to infer the evolution of the system our observations about this system, finally reason about the statistics about this system and reason about introducing intervention. The inference problem is normally much harder, because we involve complex interactions, and we will talk about different ways to solve this problem.

Let us talk about simulation modeling of transportation engineering. Building a road or setting up tolls are not easy, because if things are implemented in the wrong way, there will be a lot of turmoil going on in a city. Also, building a road involves billions of dollars. So we have to first carefully reason about our plan through simulation modeling before we can really implement it.

Transportation engineers start simulation modeling from collecting data. They collect the road network, which is similar to the road network that you use in a GPS navigator. They also collect the population density and typical travel patterns, which are from the census data and travel surveys. In the census data we have different census tracts, and in each tract we have number of households, and the average incomes of the households, and population distribution, and so on. The transportation department calls different people and ask then different questions about daily travel such as: “What is the distance from where you live to where you work?”; “How many trips to you have, and what are the types of trips?”. From this kind of data, we can synthesize the typical trips of different people. Now we have the travel network, which supplies the transportation, and we have the travel demand, which is synthesized from the population surveys.

After we have those two different types of information, we proceed to put this data into a simulator. The simulator will simulate how people move out in the morning, how traffic jams form due to the interaction of many cars on the street, how traffic jams result in detours and increased the cost of commuters in terms of fuel, late arrival to the workspace, and so on. After the simulation, the individual agents in the multi-agent system will try to improve their travel plans this system just like we improve our trips from day to day in reality. The simulator will just go through this process again and again and again until the system reaches equilibrium. At this time, we know how the average people head out to work, and head back from the work, take lunch, and pick up kids, and so on.

This system evolution is defined by a sequence of events. If P is a person, and L is a location, which could be a road or a building, the events are $P L_1 \to P L_2$ with rate of happening being $c$. The events happen at different rates: If the road is long and the traffic is heavy, the rate of moving out of the road will be low; If one direction leads to a more popular place than another direction, there will be a higher probability for people to take the first direction. Again, we’re able to define the dynamics of people in terms of a stochastic process.

Now suppose that we can observe the trajectories of taxi cabs: whether a taxi cab moves slower than previously on a road, or whether there’re more taxi cabs on a road, and so on. We can proceed or engage reasoning in the same ways as we did before. Starting from the number and behavior of tracked vehicles in a road link, we can determine the total number of vehicles in the link by scaling and estimating traffic conditions. If we trace the origins and destinations of the estimated number of vehicles through the behavior of the simulator and fill any gaps with prior individual travel behaviors, we can extract information about the traffic at other road links. If we then iterate estimations between the traffic at links and the trip choices of simulated vehicles, we improve our estimation of both.

By observing how people move, we can infer not only road traffic, but also economical and social indices of large social systems. There are many sectors in our industry:  the education sector the service sector, the energy sector, the banking sector, and so on. People in different sectors move in different ways. Hence by studying how people move around and how people interact with each other, we might be able to estimate the GDP of a whole city, the regions with poverty, the distribution of different industrial sectors, and the formation of a city from the core city to a much larger city.

Nowadays data is ample, and the question is how you can engage the data in your modeling. For example, how can we infer the dynamic collaborator network from the DBLP data set, which is a publication database in the field of computer science? Why people collaborated in the past, who will collaborate in the future, and what will be the hot topics in a scientific field? For example, from the proximity data about computer science students captured with mobile phones, how do we figure out the typical paths through which cold and flu spread and organize campaigns to improve health? For example, from the D4D data set about the hourly locations and phone communications of 12 million people in Senegal, how can we make inferences about the road traffic, the big events at a specific places, the spreading of malaria, and so on.

### Summary

The learning objective is that at the end of this semester, you will be able to simulate Markov processes. A Markov process is a process that, given the present, we do not have to worry about the past to make inferences about the future. You will be able to make inferences about these different Markov processes by combining simulation models with data, and you’ll be able to reason about dynamics of those processes: What is the average duration of a sunny day; What will be the first time for the prey population to be lower than some threshold? What will be the performance of a group? What is the probability for there to be a traffic jam at the certain places? What are the poorer areas, and what are the richer areas?

To achieve the learning objectives, we’ll talk about how do we simulate the Markov processes, and then we’ll talk about how do we make inference about Markov processes. We’ll talk about, we just mentioned exactly inference today, but most of the time, the processes are pretty complicated. We’ll have to use either MCMC, meaning a sampling or we will use some optimization technique to be able to find optimal approximate solution. Then we’ll talk about the properties of Markov processes, and then we will apply our knowledge to understand the different systems that you may encounter in the future. Most of the systems are systems about people. The tool that we will use in this course is R.

We plan to have four problem sets, and the first problem set will be out by the end of September the 11th, next Friday. I normally give two weeks for you to solve the problems. There will be four problem sets — But first, don’t worry, I think you will all get above B+; Second, if you have problems in finishing the problems, discuss with me. I think hands on experience is pretty important if we want to understand things. If we do not solve the problems.

At the end of the semester we will each submit a term paper, which could be either a literature review of some field that you are interested in related to machine learning and stochastic process, or some exploration about your own projects, or some preparation about the paper that you want to publish next.

Participation. In this semester, I will allocate 10% of the score in participation in the form of contributing  lecture notes to the course web log. I’ll finish the first few, you will do the rest, and I will put all notes together and share them with the whole class at the end of the semester, please register to the course web log so that you can contribute to lecture notes.

In summary, nowadays we have a lot of data, the industry is interested in big data, and academia is interested in big data as well. There are a lot of opportunities for us to understand social systems, chemical systems, biological systems, and so on from these data. The issue here is how we can engage those data and understand our systems and help people. In this course, we will study together on the mathematical tools that we might find useful in the future. I will see you in the next class.