Simulation of epidemiological models

11-14 October 2022

Julien Arino (julien.arino@umanitoba.ca)

Department of Mathematics & Data Science Nexus
University of Manitoba*

Canadian Centre for Disease Modelling
Canadian COVID-19 Mathematical Modelling Task Force
NSERC-PHAC EID Modelling Consortium (CANMOD, MfPH, OMNI/RÉUNIS)

* The University of Manitoba campuses are located on original lands of Anishinaabeg, Cree, Oji-Cree, Dakota and Dene peoples, and on the homeland of the Métis Nation.

Outline

  • Foreword: the R language
  • Programming in R
  • Dealing with data
  • Solving ODE numerically
  • Simulating continuous-time Markov chains

Foreword: the R language

(and other solutions for scientific computing)

R was originally for stats but is now much more

  • Open source version of S
  • Appeared in 1993
  • Now version 4.2
  • One major advantage in my view: uses a lot of C and Fortran code. E.g., deSolve:

The functions provide an interface to the FORTRAN functions 'lsoda', 'lsodar', 'lsode', 'lsodes' of the 'ODEPACK' collection, to the FORTRAN functions 'dvode', 'zvode' and 'daspk' and a C-implementation of solvers of the 'Runge-Kutta' family with fixed or variable time steps

  • Very active community on the web, easy to find solutions (same true of Python, I just prefer R)

Development environments

  • Terminal version, not very friendly
  • Nicer terminal: radian
  • Execute R scripts by using Rscript name_of_script.R. Useful to run code in cron, for instance
  • Use IDEs:
    • RStudio has become the reference
    • RKWard is useful if you are for instance using an ARM processor (Raspberry Pi, some Chromebooks..)
  • Integrate into jupyter notebooks

Going further

  • RStudio server: run RStudio on a Linux server and connect via a web interface
  • Shiny: easily create an interactive web site running R code
  • Shiny server: run Shiny apps on a Linux server
  • Rmarkdown: markdown that incorporates R commands. Useful for generating reports in html or pdf, can make slides as well..
  • RSweave: LaTeX incorporating R commands. Useful for generating reports. Not used as much as Rmarkdown these days

R is a scripted language

  • Interactive
  • Allows you to work in real time
    • Be careful: what is in memory might involve steps not written down in a script
    • If you want to reproduce your steps, it is good to write all the steps down in a script and to test from time to time running using Rscript: this will ensure that all that is required to run is indeed loaded to memory when it needs to, i.e., that it is not already there..

There is also Python

  • Python is also scripted
  • I am not a fan: indentation-delimited code blocks is a super bad idea IMBO (I don't know any good programmer who do not indent their code) and I do not enjoy the mix of OO and functional approaches
  • That does not mean it is not a good solution, though
  • Python, even more than R, benefits from a massive user pool, so finding help on the web is usually simple

Jupyter notebooks (and syzygy.ca)

An alternative to using RStudio to develop in R or whatever Python IDE you like is to use Jupyter notebooks, labs or voilà. Jupter also runs Julia

If you are affiliated with the following Canadian institutions:

Athabasca, BCIT, Brock, ETS Montréal, McGill, McMaster, Queens, SFU, UAlberta, UBC, UCalgary, ULethbridge, UManitoba, UNB, UNBC, UOttawa, UQAM, URegina, USask, USherbrooke, UToronto, UVic, UWaterloo, York

you can use Jupyter notebooks from syzygy.ca, it is free. Note that space and computing power is limited on syzygy.ca, but to test run simple R, Python or Julia code, this is a very good platform

There are other solutions

And of course all the paying stuff (MatLab, Maple, Mathematica), which I do not use as a matter of principle

The "Stack Overflow test"

When choosing a language (if you get to choose), perform the Stack Overflow test: search for the language within square brackets, e.g., [python], [r], etc.

As of 2022-10-08,

  • [python]: 2,034,840 questions
  • [r]: 465,987 questions
  • [julia]: 11,041 questions
  • [octave]: 5,140 questions ([matlab]: 93,540 questions)
  • [sage]: 872 questions

Unless you have a resident expert available, this is good to bear in mind!

Programming in R

Similar to matlab..

.. with some differences, of course! Otherwise, where would the fun be? ;)

Assignment

Two ways:

X <- 10

or

X = 10

First version is preferred by R purists.. I don't really care

Lists

A very useful data structure, quite flexible and versatile. Empty list: L <- list(). Convenient for things like parameters. For instance

L <- list()
L$a <- 10
L$b <- 3
L[["another_name"]] <- "Plouf plouf"
> L[1]
$a
[1] 10
> L[[2]]
[1] 3
> L$a
[1] 10
> L[["b"]]
[1] 3
> L$another_name
[1] "Plouf plouf"

Vectors

x = 1:10
y <- c(x, 12)
> y
 [1]  1  2  3  4  5  6  7  8  9 10 12
z = c("red", "blue")
> z
[1] "red"  "blue"
z = c(z, 1)
> z
[1] "red"  "blue" "1"

Note that in z, since the first two entries are characters, the added entry is also a character. Contrary to lists, vectors have all entries of the same type

Matrices

Matrix (or vector) of zeros

A <- mat.or.vec(nr = 2, nc = 3)

Matrix with prescribed entries

B <- matrix(c(1,2,3,4), nr = 2, nc = 2)
> B
     [,1] [,2]
[1,]    1    3
[2,]    2    4
C <- matrix(c(1,2,3,4), nr = 2, nc = 2, byrow = TRUE)
> C
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Remark that here and elsewhere, naming the arguments (e.g., nr = 2) allows to use arguments in any order

Matrix operations

Probably the biggest annoyance in R compared to other languages

  • The notation A*B is the Hadamard product (what would be denoted A.*B in matlab), not the standard matrix multiplication
  • Matrix multiplication is written A %*% B

Vector operations

Vector addition is also frustrating. Say you write x=1:10, i.e., make the vector

> x
 [1]  1  2  3  4  5  6  7  8  9 10

Then x+1 gives

> x+1
 [1]  2  3  4  5  6  7  8  9 10 11

i.e., adds 1 to all entries in the vector

Beware of this in particular when addressing sets of indices in lists, vectors or matrices

For the matlab-ers here

  • R does not have the keyword end to access the last entry in a matrix/vector/list..
  • Use length (lists or vectors), nchar (character chains), dim (matrices.. careful, of course returns 2 values)

Flow control

if (condition is true) {
  list of stuff to do
}

Even if list of stuff to do is a single instruction, best to use curly braces

if (condition is true) {
  list of stuff to do
} else if (another condition) {
  ...
} else {
  ...
}

For loops

for applies to lists or vectors

for (i in 1:10) {
  something using integer i
}
for (j in c(1,3,4)) {
  something using integer j
}
for (n in c("truc", "muche", "chose")) {
  something using string n
}
for (m in list("truc", "muche", "chose", 1, 2)) {
  something using string n or integer n, depending
}

lapply

Very useful function (a few others in the same spirit: sapply, vapply, mapply)

Applies a function to each entry in a list/vector/matrix

Because there is a parallel version (parLapply) that we will see later, worth learning

l = list()
for (i in 1:10) {
        l[[i]] = runif(i)
}
lapply(X = l, FUN = mean)

or, to make a vector

unlist(lapply(X = l, FUN = mean))

or

sapply(X = l, FUN = mean)

"Advanced" lapply

Can "pick up" nontrivial list entries

l = list()
for (i in 1:10) {
        l[[i]] = list()
        l[[i]]$a = runif(i)
        l[[i]]$b = runif(2*i)
}
sapply(X = l, FUN = function(x) length(x$b))

gives

[1]  2  4  6  8 10 12 14 16 18 20

Just recall: the argument to the function you define is a list entry (l[[1]], l[[2]], etc., here)

Avoid parameter variation loops with expand.grid

# Suppose we want to vary 3 parameters
variations = list(
    p1 = seq(1, 10, length.out = 10),
    p2 = seq(0, 1, length.out = 10),
    p3 = seq(-1, 1, length.out = 10)
)

# Create the list
tmp = expand.grid(variations)
PARAMS = list()
for (i in 1:dim(tmp)[1]) {
    PARAMS[[i]] = list()
    for (k in 1:length(variations)) {
        PARAMS[[i]][[names(variations)[k]]] = tmp[i, k]     
    }
}

There is still a loop, but you can split this list, use it on different machines, etc. And can use parLapply (more on that later)

Dealing with data

  • Example: population of Canada
  • Data wrangling

It is important to be "data aware"

  • Using R (or Python), it is really easy to grab data from the web, e.g., from Open Data sources
  • More and more locations have an open data policy
  • As a modeller, you do not need to have data everywhere, but you should be aware of the context
  • If you want your work to have an impact, for instance in public health, you cannot be completely disconnected from reality

Data is everywhere

Closed data

  • Often generated by companies, governments or research labs
  • When available, come with multiple restrictions

Open data

  • Often generated by the same entities but "liberated" after a certain period
  • More and more frequent with governments/public entities
  • Wide variety of licenses, so beware
  • Wide variety of qualities, so beware

Open Data initiatives

Recent movement (5-10 years): governments (local or higher) create portals where data are centralised and published

Data gathering methods

  • By hand
  • Using programs such as Engauge Digitizer or g3data
  • Using APIs
  • Using natural language processing and other web scraping methods
  • Using R or Python packages

Example: population of Canada

library(wbstats)
source("useful_functions.R")
pop_data_CTRY <- wb_data(country = "CAN", indicator = "SP.POP.TOTL",
                         mrv = 100, return_wide = FALSE)
y_range = range(pop_data_CTRY$value)
y_axis <- make_y_axis(y_range)
png(file = "pop_CAN.png", 
    width = 800, height = 400)
plot(pop_data_CTRY$date, pop_data_CTRY$value * y_axis$factor,
     xlab = "Year", ylab = "Population", type = "b", lwd = 2,
     yaxt = "n")
axis(2, at = y_axis$ticks, labels = y_axis$labels, las = 1)
dev.off()
crop_figure("pop_CAN.png")

Data wrangling

Data wrangling: dplyr vs sqldf

dplyr is part of the tidyverse set of libraries. Load magrittr and its pipe %>%

sqldf allows to use SQL on dataframes.. interesting alternative if you know SQL

library(sqldf)
library(dplyr)

SARS = read.csv("../DATA/SARS-CoV-1_data.csv")

## Three ways to keep only the data for one country
ctry = "Canada"
# The basic one
idx = which(SARS$country == ctry)
SARS_selected = SARS[idx,]
# The sqldf way
SARS_selected = sqldf(paste0("SELECT * FROM SARS WHERE country = '", 
                             ctry, "'"))
# The dplyr way
SARS_selected = SARS %>%
  filter(country == ctry)
# Create incidence for the selected country. diff does difference one by one,
# so one less entry than the vector on which it is being used, thus we pad with a 0.
SARS_selected$incidence = c(0, diff(SARS_selected$totalNumberCases))
# Keep only positive incidences (discard 0 or negative adjustments)
SARS_selected = SARS_selected %>%
  filter(incidence > 0)

# Plot the result
# Before plotting, we need to make the dates column we will use be actual dates..
SARS_selected$toDate = lubridate::ymd(SARS_selected$toDate)
EpiCurve(SARS_selected,
         date = "toDate", period = "day",
         freq = "incidence",
         title = "SARS-CoV-1 incidence in Canada in 2003")

Solving ODE numerically

  • The deSolve library
  • Example - Fitting data

The deSolve library

The deSolve library

  • As I have already pointed out, deSolve:

The functions provide an interface to the FORTRAN functions 'lsoda', 'lsodar', 'lsode', 'lsodes' of the 'ODEPACK' collection, to the FORTRAN functions 'dvode', 'zvode' and 'daspk' and a C-implementation of solvers of the 'Runge-Kutta' family with fixed or variable time steps

  • So you are benefiting from years and year of experience: ODEPACK is a set of Fortran (77!) solvers developed at Lawrence Livermore National Laboratory (LLNL) starting in the late 70s

  • Other good solvers are also included, those written in C

  • Refer to the package help for details

Using deSolve for simple ODEs

As with more numerical solvers, you need to write a function returning the value of the right hand side of your equation (the vector field) at a given point in phase space, then call this function from the solver

library(deSolve)
rhs_logistic <- function(t, x, p) {
  with(as.list(x), {
    dN <- p$r * N *(1-N/p$K)
    return(list(dN))
  })
}
params = list(r = 0.1, K = 100)
IC = c(N = 50)
times = seq(0, 100, 1)
sol <- ode(IC, times, rhs_logistic, params)

This also works: add p to arguments of as.list and thus use without p$ prefix

library(deSolve)
rhs_logistic <- function(t, x, p) {
  with(as.list(c(x, p)), {
    dN <- r * N *(1-N/K)
    return(list(dN))
  })
}
params = list(r = 0.1, K = 100)
IC = c(N = 50)
times = seq(0, 100, 1)
sol <- ode(IC, times, rhs_logistic, params)

In this case, beware of not having a variable and a parameter with the same name..

Default method: lsoda

  • lsoda switches automatically between stiff and nonstiff methods

  • You can also specify other methods: "lsode", "lsodes", "lsodar", "vode", "daspk", "euler", "rk4", "ode23", "ode45", "radau", "bdf", "bdf_d", "adams", "impAdams" or "impAdams_d" ,"iteration" (the latter for discrete-time systems)

ode(y, times, func, parms, method = "ode45")
  • You can even implement your own integration method

Example - Fitting data

Example - Fitting data

Principle

  • Data is a set , , where , some interval
  • Solution to SIR is for
  • Suppose parameters of the model are
  • We want to minimise the error function

  • Norm is typically Euclidean, but could be different depending on objectives
  • So given a point in (admissible) parameter space, we compute the solution to the ODE, compute
  • Using some minimisation algorithm, we seek a minimum of by varying

What are and here?

  • In epi data for infectious diseases, we typically have incidence, i.e., number of new cases per unit time
  • In SIR model, this is or , so, if using mass action incidence and Euclidean norm

or, if using standard incidence

Implementing in practice

See the code SIR_KMK_fitting.R, which we will go over now

Simulating continuous-time Markov chains

(a.k.a. making your ODE wriggly)

  • Why stochasticity matters
  • Continuous time Markov chains
  • ODE CTMC
  • Simulating CTMC (in theory)
  • Simulating CTMC (in practice)
  • Parallelising your code in R

Why stochasticity matters

Why stochastic systems?

  • Life results from the interaction of countless processes

  • Each process happens with some degree of imprecision. E.g., when a virus replicates, it misses a base here and there; when a cell undergoes mitosis, material is not split 50-50; when you meet another human, you forget to shake their hand. Etc.

  • Because of repetition, this gives rise to uncertainty

  • Stochastic systems incorporate uncertainty

Running example

SIS model without demography

  • Constant total population

  • Basic reproduction number

In the deterministic world, rules the world

  • If , the disease dies out (disease-free equilibrium)
  • If , it becomes established at an endemic equilibrium

Next slides: , , (and ),

In stochastic world, make that '' rules-ish'' ()

center

Extinctions happen quite frequently

In following slide, we

  • vary in
  • vary from 0.5 to 3 by steps of 0.05
  • run 500 simulation for each value pair
  • record the percentage of realisations that see extinctions in the time period under consideration ([0,100] days)

When , extinctions happen quite frequently

Continuous time Markov chains

In this lecture

  • I will only show one very specific aspect of stochastic systems: continuous-time Markov chains (CTMC), which are essentially the stochastic equivalent of ODE
  • There are plenty of resources on stochastic systems in math epi. For instance, see here some slides and videos I made for a course (the Wednesday April 6 2022 lectures are the ones on stochastic systems)

Continuous time Markov chains (CTMC)

  • Almost exact stochastic equivalent to ODE
  • Conversion from ODE to CTMC and vice-versa is very easy for compartmental models (see later)
  • Harder to study than discrete-time Markov chains (DTMC) but still quite amenable to analysis

Continuous-time Markov chains

CTMC similar to DTMC except in way they handle time between events (transitions)

DTMC: transitions occur each

CTMC: and transition times follow an exponential distribution parametrised by the state of the system

CTMC are roughly equivalent to ODE

Several ways to formulate CTMC's

A continuous time Markov chain can be formulated in terms of

  • infinitesimal transition probabilities
  • branching process
  • time to next event

Here, time is in

Example of a simple birth-death process

For small ,

with as

ODE CTMC

Converting your compartmental ODE model to CTMC

Easy as :)

  • Compartmental ODE model focuses on flows into and out of compartments
    • ODE model has as many equations as there are compartments
  • Compartmental CTMC model focuses on transitions
    • CTMC model has as many transitions as there are arrows between (or into or out of) compartments

ODE to CTMC : focus on different components

SIS without demography

Transition Effect Weight Probability
, new infection
, recovery of an infectious

States are

SIS with demography

Transition Effect Weight Probability
birth of a susceptible
death of a susceptible
, new infection
death of an infectious
, recovery of an infectious

States are

Kermack & McKendrick model

Transition Effect Weight Probability
, new infection
, recovery of an infectious

States are

Simulating CTMC (in theory)

Gillespie's algorithm

  • A.k.a. the stochastic simulation algorithm (SSA)
  • Derived in 1976 by Daniel Gillespie
  • Generates possible solutions for CTMC
  • Extremely simple, so worth learning how to implement; there are however packages that you can use (see later)

Gillespie's algorithm

Suppose system has state with initial condition and propensity functions of elementary reactions

set and
while {}

  • Draw from
  • Draw from
  • Find , smallest integer s.t.
  • Effect the next reaction (the one indexed )

Drawing at random from an exponential distribution

If you do not have an exponential distribution random number generator.. We want from , i.e., has probability density function

Use cumulative distribution function

which has values in . So draw from and solve for

Gillespie's algorithm (SIS model with only I eq.)

set and
while {}

  • Draw from
  • Draw from
  • Find such that
  • switch {}
    • 1: New infection,
    • 2: End of infectious period,

Sometimes Gillespie goes bad

  • Recall that the inter-event time is exponentially distributed
  • Critical step of the Gillespie algorithm:
    • weight of all possible events (propensity)
    • Draw from
  • So the inter-event time if becomes very large for some
  • This can cause the simulation to grind to a halt

Example: a birth and death process

  • Individuals born at per capita rate
  • Individuals die at per capita rate
  • Let's implement this using classic Gillespie

(See simulate_birth_death_CTMC.R on course GitHub repo)

Gillespie's algorithm (birth-death model)

set and
while {}

  • Draw from
  • Draw from
  • Find such that
  • switch {}
    • 1: Birth,
    • 2: Death,
b = 0.01   # Birth rate
d = 0.01   # Death rate
t_0 = 0    # Initial time
N_0 = 100  # Initial population

# Vectors to store time and state. Initialise with initial condition.
t = t_0
N = N_0

t_f = 1000  # Final time

# We'll track the current time and state (could also just check last entry in t
# and N, but will take more operations)
t_curr = t_0
N_curr = N_0
while (t_curr<=t_f) {
  xi_t = (b+d)*N_curr
  # The exponential number generator does not like a rate of 0 (when the 
  # population crashes), so we check if we need to quit
  if (N_curr == 0) {
    break
  }
  tau_t = rexp(1, rate = xi_t)
  t_curr = t_curr+tau_t
  v = c(b*N_curr, xi_t)/xi_t
  zeta_t = runif(n = 1)
  pos = findInterval(zeta_t, v)+1
  switch(pos,
         { 
           # Birth
           N_curr = N_curr+1
         },
         {
           # Death
           N_curr = N_curr-1
         })
  N = c(N, N_curr)
  t = c(t, t_curr)
}

Last one did not go well

  • Wanted 1000 time units (days?)
  • Interrupted at because I lost patience
  • (Penultimate slide: sim stopped because the population went extinct, I did not stop it!)
  • At stop time
    • N = 103,646
    • (and as well, of course!)
    • time was moving slowly
> tail(diff(t))
[1] 1.282040e-05 5.386999e-04 5.468540e-04 1.779985e-04 6.737294e-05 2.618084e-04

Simulating CTMC (in practice)

Tau-leaping (and packages) to the rescue!

  • Approximation method (compared to classic Gillespie, which is exact)
  • Roughly: consider "groups" of events instead of individual events
  • Good news: GillespieSSA2 and adaptivetau, two standard packages for SSA in R, implement tau leaping

Simulating a CTMC

library(GillespieSSA2)
IC <- c(S = (Pop-I_0), I = I_0)
params <- c(gamma = gamma, beta = beta)
reactions <- list(
  reaction("beta*S*I", c(S=-1,I=+1), "new_infection"),
  reaction("gamma*I", c(S=+1,I=-1), "recovery")
)
set.seed(NULL)
sol <- ssa(
    initial_state = IC,
    reactions = reactions,
    params = params,
    method = ssa_exact(),
    final_time = t_f,
)
plot(sol$time, sol$state[,"I"], type = "l",
     xlab = "Time (days)", ylab = "Number infectious")

Parallelising your code in R

Parallelisation

To see multiple realisations: good idea to parallelise, then interpolate results. Write a function, e.g., run_one_sim that .. runs one simulation, then..

nb_cores <- detectCores()
if (nb_cores > 124) {
  nb_cores = 124
}
cl <- makeCluster(nb_cores)
clusterEvalQ(cl,{
  library(GillespieSSA2)
})
clusterExport(cl,
              c("params",
                "run_one_sim"),
              envir = .GlobalEnv)
SIMS = parLapply(cl = cl, 
                 X = 1:params$number_sims, 
                 fun =  function(x) run_one_sim(params))
stopCluster(cl)

See SIS_CTMC_parallel.R on course GitHub repo

Benefit of parallelisation

Run the parallel code for 50 sims between tictoc::tic() and tictoc::toc(), giving 4.785 sec elapsed, then the sequential version

tictoc::tic()
SIMS = lapply(X = 1:params$number_sims, 
              FUN =  function(x) run_one_sim(params))
tictoc::toc()

which gives 158.306 sec elapsed on a 64C/128T AMD Ryzen Threadripper 3990X. That is, parallel ran 33.08 faster !)