Model Definitions in McMasterPandemic

Prerequisites

This document assumes knowledge of compartmental epidemic models. We will not be defining terms and symbols that we consider to be widely understood by mathematical epidemiologists.

Document Goals

This document describes a specification for a data structure to represent compartmental models. The goal of this document is not to define a spec for a user interface for constructing compartmental models. Instead, we define the data structure here so that interfaces for creating instances of compartmental models can be designed without needing to grapple with issues of how to represent and store model definitions.

User Stories

This spec design has been guided by several high-level user stories, which we outline briefly.

As a modeller, I want to use software that provides an extensive set of modelling capabilities, so that I can address specific and changing public health needs.

As a modeller, I want to build complex models out of simpler – easier to understand – modular sub-models, so that I can more easily modify the structure of my model as public health needs change.

As a developer, I want to be able to quickly add a new modelling capability, so that I can support public health modellers in as timely a manner as possible.

These and other user-stories were developed by working with public-health modellers during the COVID-19 pandemic in Canada.

Labelled Partitions

Before describing a data structure for representing Compartmental Models, we need to define the concept of labelled partitions. This concept allows for a consistent, flexible, compact, and readable mechanism for identifying model variables and sets of variables. For example, in a spatially structured model we might want to refer to all states in a particular location (e.g. Toronto) and a specific state within that location (e.g. susceptible individuals in Toronto).

There are two equivalent representations of a labelled partition, dotted and undotted. This equivalence allows us to go back-and-forth between the two representations without loosing information, but perhaps gaining convenience.

The undotted representation can be thought of as a data frame. The rows of the data frame represent the things being represented. Our primary example of a ‘thing’ is a variable in a compartmental model (e.g. number of susceptible individuals in Toronto, or the force of infection). The columns of the data frame represent different ways to partition the rows.

#>   Epi   Vax
#> 1   S unvax
#> 2   E unvax
#> 3   I unvax
#> 4   R unvax
#> 5   S   vax
#> 6   E   vax
#> 7   I   vax
#> 8   R   vax

Compartmental Models

A compartmental model is a directory with the following four files.

  1. variables – A CSV file describing the variables in the model.
  2. derivations – JSON file with an array of instructions on how to update the variables at each iteration of a model simulation.
  3. flows – A CSV file describing what variables are state variables and the magnitudes of flows to and from compartments.
  4. settings – A JSON file with an object of elements that determine how files 1-3 are used by the software.

The following sections describe each of these files in more detail.

Variables and Partitions

The variables.csv file has row for each variable in the model and one column for each labelled partition of the variables. A labelled partition is a character vector of labels that describes the variables. Each row in this file must be unique so that it uniquely identifies each variable.

For example, here is a list of variables for the standard SEIR model.

#>     Epi
#> 1     S
#> 2     E
#> 3     I
#> 4     R
#> 5  beta
#> 6 alpha
#> 7 gamma
#> 8     N
#> 9   foi

This example has a single labelled partition, Epi. This partition uniquely identifies each variable and so we do not need other partitions to comply with the uniqueness requirement. Still, it is often convenient to add other partitions to refer to multiple variables at the same time. For example it might help to distinguish between state variables and parameters and also between infected states and non-infected states.

#>      Type       Infect   Epi
#> 1   state not_infected     S
#> 2   state     infected     E
#> 3   state     infected     I
#> 4   state not_infected     R
#> 5   param               beta
#> 6   param              alpha
#> 7   param              gamma
#> 8 derived                  N
#> 9 derived                foi

We emphasize that the list with the single Epi partition did not violate this spec, as it is not required to add a partition to make a distinction between state variables and parameters. The spec distinguishes Variable Roles like this in the settings file. In this example, the addition of the Type and Infect partitions are for convenience. The point is that the user is free to add any number of partitions even if they are not required for uniqueness.

We use the term names for strings that identify partitions and the term labels for strings that identify variables with respect to a particular partition.

Note that we do not provide a column for the numeric quantity of each variable because this is a spec for a compartmental model data structure. By working with pure model structure objects, we can combine different atomic models to more easily produce models with more structure. When it comes time to use a model, the computational engine can be used to associate model variables with numerical quantities.

Non-empty labels in each partition must contain only letters, numbers, underscores, and must start with a letter. Empty labels are zero-length strings that can be used to indicate that some partitions are not applicable to some variables. The purpose for these naming and labelling rules is to facilitate the construction of Variable Names that can be inverted to reproduce the variables list.

By convention we use UpperCamelCase for partition names and a modified form of snake_case for variable labels. Our modification of snake case allows for single uppercase letters in order to accommodate the convention in epidemiology for using single uppercase letters to refer to state variables. For example, S, I, and R, as well as I_mild and I_severe, would be consistent with our modified snake case style.

Required Partitions

There is a component, settings$required_partitions, which is a character vector of the names of the partitions that are used to uniquely identify each variable. In the previous example with Type, Infect, and Epi partitions we have settings$required_partitions == "Epi". We refer to the labels associated with required partitions as required labels. Every variable must have at least one non-empty required label. The required partitions are used to generate Variable Names.

Atomic and Non-Atomic Models

We refer to models with exactly one required partition as atomic. One may combine two atomic models to get another model with two required labelled partitions. An example combination of atomic models is the SEIR model stratified by the states of an atomic vaccination model with required partition, Vax.

#>         Vax
#> 1     unvax
#> 2       vax
#> 3 dose_rate

In this atomic model unvax and vax are state variables representing the numbers of individuals who are unvaccinated and vaccinated. The dose_rate variable is a parameter giving the proportion of unvaccinated individuals who receive a vaccination each day.

We combine these two atomic models to produce a non-atomic model with required partitions, Epi and Vax.

#>      Epi       Vax
#> 1      S     unvax
#> 2      E     unvax
#> 3      I     unvax
#> 4      R     unvax
#> 5   beta     unvax
#> 6  alpha          
#> 7  gamma          
#> 8      N     unvax
#> 9    foi     unvax
#> 10     S       vax
#> 11     E       vax
#> 12     I       vax
#> 13     R       vax
#> 14  beta       vax
#> 15 alpha          
#> 16 gamma          
#> 17     N       vax
#> 18   foi       vax
#> 19       dose_rate

Note here that the state variables, S, E, I, and R, as well as the transmission rate, beta, are stratified by vaccination status, but the latency and recovery parameters, alpha and gamma, are not stratified indicating that latency and recovery is not influenced by vaccination status. Similarly the dose_rate is also not stratified by epidemiological state.

Consider one more example with different symptomatic statuses, where infectious individuals have either mild or severe symptoms and these different statuses recover at different rates.

#>               Epi   Symp       Vax
#> 1               S            unvax
#> 2               E            unvax
#> 3               I   mild     unvax
#> 4               I severe     unvax
#> 5               R            unvax
#> 6            beta            unvax
#> 7               S              vax
#> 8               E              vax
#> 9               I   mild       vax
#> 10              I severe       vax
#> 11              R              vax
#> 12           beta              vax
#> 13          alpha                 
#> 14          gamma   mild          
#> 15          gamma severe          
#> 16 infectiousness   mild          
#> 17 infectiousness severe          
#> 18                       dose_rate

Variable Names

Although each row of the variables list provides a unique identifier for each variable, it is convenient to be able to combine the required labels into a single string. We call this string the variable name.

The name of a variable is the dot-delimited concatenation of the partitions in settings$required_partitions associated with the variable. For example, the following table contains the labelled partitions of the Epi-by-Symp-by-Vax variable list with a fourth column giving the variable names.

#>      Epi   Symp       Vax variable_names
#> 1      S            unvax       S..unvax
#> 2      E            unvax       E..unvax
#> 3      I   mild     unvax   I.mild.unvax
#> 4      I severe     unvax I.severe.unvax
#> 5      R            unvax       R..unvax
#> 6   beta            unvax    beta..unvax
#> 7      S              vax         S..vax
#> 8      E              vax         E..vax
#> 9      I   mild       vax     I.mild.vax
#> 10     I severe       vax   I.severe.vax
#> 11     R              vax         R..vax
#> 12  beta              vax      beta..vax
#> 13 alpha                         alpha..
#> 14 gamma   mild              gamma.mild.
#> 15 gamma severe            gamma.severe.
#> 16              dose_rate    ..dose_rate

This choice of a dot delimiter leads to syntactically valid R names even if the name begins with one or more dot. Because dots are not allowed in the labels themselves, this delimiter also makes parsing the labels into their respective partitions unambiguous. Note that it is impossible to have names with all dots because all variables must have at least one non-blank label.

The biggest downside to the choice of a dot delimiter is that it conflicts with the tidyverse style, which requires using dots only to specify S3 method dispatch. The reason for this guideline is to reduce the chances of ambiguity when the S3 machinery searches for methods. This disadvantage is in our opinion outweighed by the advantages that we list above, because model variables will not be involved in S3 dispatch.

Variable names are invertible, in that a vector of variable names can be combined with a vector of partition names to reproduce the associated labelled partitions. This fact allows us to choose either the labelled partitions or names representation depending on convenience or necessity.

Variable Roles

Variable Names are used in lists that identify variables with specific roles in the model. There are two such lists in the settings.json file.

All state variables and flow variables must be scalars.

Partition Sets

The required_paritions component is a set of partitions that are used to uniquely identify each variable and to generate invertible variable names. The user is free to define other partition_sets so that they can refer to groups of variables by name. These sets can be used to filter the variables for a specific purpose. Such partition_sets are useful when combining models.

Variable names can be used with respect to a particular partition set. Variable names with respect to some partition sets do not uniquely identify each variable. For example, Epi.Symp == "I.mild" occurs twice in the following example.

#>      Epi   Symp       Vax     Epi.Symp    Epi.Vax
#> 1      S            unvax           S.    S.unvax
#> 2      E            unvax           E.    E.unvax
#> 3      I   mild     unvax       I.mild    I.unvax
#> 4      I severe     unvax     I.severe    I.unvax
#> 5      R            unvax           R.    R.unvax
#> 6   beta            unvax        beta. beta.unvax
#> 7      S              vax           S.      S.vax
#> 8      E              vax           E.      E.vax
#> 9      I   mild       vax       I.mild      I.vax
#> 10     I severe       vax     I.severe      I.vax
#> 11     R              vax           R.      R.vax
#> 12  beta              vax        beta.   beta.vax
#> 13 alpha                        alpha.     alpha.
#> 14 gamma   mild             gamma.mild     gamma.
#> 15 gamma severe           gamma.severe     gamma.
#> 16              dose_rate            . .dose_rate

The ability to refer to multiple variables with a single name is useful in many circumstances. Typically such non-unique names are unique within groups of variables. For example, Epi.Symp == "I.mild" is unique within Vax == "vax" and within Vax == "unvax".

As the examples above have illustrated, the name of a partition set is the dot-delimited concatenation of the names of the partitions in the set. For example, Epi.Symp is the name of the partition set containing the Epi and Symp partitions.

Derived Variables

Now that we have a solid mechanism for representing, referring to, and grouping model variables, we can begin to consider defining Flow between Compartments. When the rate of flow from one compartment to another is given by a model parameter (e.g. recovery rate, gamma), then it is straightforward to describe the flow between those compartments (e.g. I to R). However, when the flow rate is a function of parameters and state variables (e.g. force of infection), we first need to define how to compute such a rate.

In this section we describe a data structure for representing mathematical expressions for deriving model variables that must be recomputed at each iteration of a simulation. Flow rates such as the force of infection is an example of such a derived variable. There are many other examples, including sums of groups of (e.g. vaccinated) compartments, model summaries such as \(\mathcal{R}_0\), and convolutions of infected compartments for modelling under-reporting and reporting delays.

Each derived variable must be represented by a row in the variables list. Derived variables are distinguished from input variables by being associated with an element in a list, derivations. This list contains objects that describe how to compute each derived variable. Each object contains the following components.

TODO: Perhaps this should be a ‘compressed’ data format for derivations, and the simple derived_var_name = function(...) ... should be used as the base case? The compressed format can be used to form the basis of an interface design?

The spec for this file will be hard to understand on first reading. To understand it better we provide several examples designed to justify and clarify the description in the above bullet list.

We begin with defining the force of infection for an SEIR model with the following variable list.

#>     Epi
#> 1     S
#> 2     E
#> 3     I
#> 4     R
#> 5  beta
#> 6 alpha
#> 7 gamma
#> 8     N
#> 9   foi

The force of infection for this model can be represented as follows.

list(
  output_partition = "Epi",
  output_names = "foi",
  input_partition = "Epi",
  math_function = function(S, E, I, R, beta) {
    I * beta / sum(S, E, I, R)
  }
)

Because this is an atomic model, by specifying output_partition = "Epi" and output_names "foi", we are really just saying that the name of the output variable is "foi". More technically we would say that the derived variable will have the label "foi" in the "Epi" partition. The input_partition and math_function go together. The math_function describes how to compute the derived variable, foi, in terms of other variables. The input_partition gives the name of a partition set to use for matching model variables to the arguments of this function. For example, the argument list contains S, which is a label in the Epi partition set.

This description may seem unnecessarily complex, but it is useful for representing more complex models as we will see. Also remember that this is a spec for a data structure and not a user interface – it is likely that a user interface for specifying this component would make it much simpler to do so.

Updating derived variable specifications when taking model products is reasonably straightforward with this data structure. For example, consider stratifying S, E, I, R, and beta by geographic location.

#>      Epi                 Location
#> 1      S                 montreal
#> 2      E                 montreal
#> 3      I                 montreal
#> 4      R                 montreal
#> 5   beta                 montreal
#> 6      S                  toronto
#> 7      E                  toronto
#> 8      I                  toronto
#> 9      R                  toronto
#> 10  beta                  toronto
#> 11 alpha                         
#> 12 gamma                         
#> 13       montreal_to_toronto_rate
#> 14       toronto_to_montreal_rate

In this new model, the functional form of the force of infection remains constant and we simply update and add some of the partition sets and variable names.

list(
  group_partition = "Location",
  group_names = c("montreal", "toronto"),
  output_partition = "Epi",
  output_names = c("foi", "foi"),
  input_partition = "Epi",
  math_function = function(S, E, I, R, beta) {
    I * beta / sum(S, E, I, R)
  }
)

This will define expressions to produce derived variables called foi.montreal and foi.toronto. To get this list we did the following.

  1. Added geographical grouping information to specify how to stratify the force of infection.
  2. Repeat the entry in output_names once for each location.

This model product assumes that individuals in Toronto cannot be infected by individuals in Montreal. This is a reasonable assumption for this spatial stratification, but in models where the stratification does not isolate compartments from each other we need to add an additional derived variable that sums the force of infection components over the strata. Consider for example the SEIR model stratified by vaccination status.

#>      Epi       Vax
#> 1      S     unvax
#> 2      E     unvax
#> 3      I     unvax
#> 4      R     unvax
#> 5   beta     unvax
#> 6  alpha          
#> 7  gamma          
#> 8      N     unvax
#> 9    foi     unvax
#> 10     S       vax
#> 11     E       vax
#> 12     I       vax
#> 13     R       vax
#> 14  beta       vax
#> 15 alpha          
#> 16 gamma          
#> 17     N       vax
#> 18   foi       vax
#> 19       dose_rate

We can compute the force of infection components just as we did for the spatial model.

list(
  group_partition = "Vax",
  group_names = c("unvax", "vax"),
  output_partition = "Epi",
  output_names = c("foi", "foi"),
  input_partition = "Epi",
  math_function = function(S, E, I, R, beta) {
    I * beta / sum(S, E, I, R)
  }
)

But to complete the model product we must add a derived variable that takes the sum of all of the Epi == "foi" variables.

list(
  output_partition = "epi.vax",
  output_names = "foi.total",
  input_partition = "epi.vax",
  input_names_dots = c("foi.unvax", "foi.vax"),
  math_function = function(...) sum(...)
)

This will produce a single derived variable called foi.total, which can be used as the force of infection for all vaccination strata.

There are other operations on model space besides model products. Another such operation is to convert a single compartment into several compartments. An example of this operation is to break up the infectious compartment into several symptomatic states.

#>               Epi EpiEffective   Symp      Type
#> 1               S            S                 
#> 2               E            E                 
#> 3               I                mild component
#> 4               I              severe component
#> 5               I                        vector
#> 6               I            I                 
#> 7               R            R                 
#> 8            beta         beta                 
#> 9           alpha        alpha                 
#> 10          gamma        gamma   mild          
#> 11          gamma        gamma severe          
#> 12 infectiousness                mild component
#> 13 infectiousness              severe component
#> 14 infectiousness                        vector
#> 15            foi          foi

Notice the "Type" partition that indicates the different forms with which the symptomatic-status-structured variables are represented. Here the scalar components of I and infectiousness can be referred to, or one can refer to vectors that contain all components. The components are I.mild.component, I.severe.component, infectiousness.mild.component, and infectiousness.severe.component. The vectors are I..vector and infectiousness..vector. The variable I..effective refers to the the weighted sum given by the inner product of I..vector and infectiousness..vector.

Here we assume that the component types are input variables, and that we derive the other I and infectiousness variables with the following definitions.

list(
  group_partition = "Epi.Symp",
  group_names = c("I.mild", "I.severe", "infectiousness.mild", "infectiousness.severe"),
  output_partition = "Epi.Type",
  output_names = c("I.vector", "infectiousness.vector"),
  input_partition = "Symp",
  input_names_dots = c("mild", "severe"),
  math_function = function(...) c(...)
)
# list(
#   #group_partition = "Epi",
#   #group_names = c("I", "I", "infectiousness.mild", "infectiousness.severe"),
#   output_partition = "Epi.EpiEffective",
#   output_names = "I.I",
#   input_partition = "Epi.Symp",
#   input_names_dots = c("I.mild", "I.severe", "infectiousness.mild", "infectiousness.severe"),
#   math_function = function(...) {
#     l = unlist(list(...))
#     n = length(l)
#     sum(l[1:(n/2)] * l[((n/2)+1):n])
#   }
# )
list(
  group_partition = "Type",
  group_names = "vector",
  output_partition = "EpiEffective",
  output_names = "I",
  input_partition = "Epi",
  math_function = function(I, infectiousness) {
    sum(I * infectiousness)
  }
)
list(
  output_partition = "Epi",
  output_names = "foi",
  input_partition = "Epi",
  math_function = function(S, E, I, R)
)

The ordering of the derivations list must be topologically sorted so that the dependencies of each expression appear earlier in the list. This topological sorting requirement presents a technical challenge when models are combined, as it becomes necessary to be able to merge lists of derivations so that they remain topologically sorted.

Flow between Compartments

We have another data frame, flows, with rows describing how individuals flow between pairs of compartments. There are four columns in this data frame, from, to, rate_component, and component_type.

For example, consider this SIR model.

#>     epi
#> 1     S
#> 2     I
#> 3     R
#> 4  beta
#> 5 gamma
#> 6   foi

We would have the following flows.csv file.

#>   from to flow_component component_type
#> 1    S  I            foi           rate
#> 2    I  R          gamma           rate

Notice that all state variables appear in this table, and that it contains no parameters. This is the mechanism for encoding whether a model variable is a state variables or a parameter.

A more interesting example is the following Epi-by-Vax model.

#>      Epi       Vax
#> 1      S     unvax
#> 2      I     unvax
#> 3      R     unvax
#> 4   beta     unvax
#> 5      S       vax
#> 6      I       vax
#> 7      R       vax
#> 8   beta       vax
#> 9  alpha          
#> 10 gamma          
#> 11   foi          
#> 12       dose_rate

The flows data frame for these variables could look like the following.

#>      from      to flow_component component_type
#> 1 S.unvax I.unvax      foi.total           rate
#> 2 I.unvax R.unvax         gamma.           rate
#> 3   S.vax   I.vax      foi.total           rate
#> 4   I.vax   R.vax         gamma.           rate
#> 5 S.unvax   S.vax     .dose_rate           rate
#> 6 I.unvax   I.vax     .dose_rate           rate
#> 7 R.unvax   R.vax     .dose_rate           rate

Flows in Structured Models

Structured models can have a large number of flows making the above flows data structure very long. For example, an SEIR model with three symptomatic status’, five vaccination status’, in 30 geographic locations would have over 750 flows depending on the details of the vaccination and geographic models. The length of the list is not in-and-of-itself the main problem with this data structure. A bigger problem is that the data structure does not preserve the model structure, making it difficult to visualize the underlying simplicity. For example, if the model structure was preserved in the flows data it would be possible to automatically construct box diagrams for communicating the simplicity of a seemingly complex compartmental model.

On the other hand, the benefit of the more explicit data structure in [Flows between Compartments] is that it is easy to read each line. For example, the following line clearly tells me that unvaccinated exposed individuals flow to unvaccinated infectious individuals with mild symptoms at a rate that is independent of both vaccination status and symptomatic status.

#>       from           to    flow flow_type
#> 1 E..unvax I.mild.unvax alpha..      rate

In this section we add more columns to the data structure for encoding model structure. These additional columns make it possible to shorten the list and retain model structure information, at the cost of making the meaning of each line more difficult to decode. However, if required, as we will see, the structured version can always be converted to the simpler but longer unstructured version. But it is not possible to go from unstructured to structured.

The first three additional columns contain partition sets to resolve the names in from, to, and flow. Consider the following example.

#>   from to flow flow_type from_partition to_partition flow_partition
#> 1    S  E  foi      rate            Epi          Epi            Epi

Consider this example within the context of an SEIR model stratified by two vaccination status’. In this case this data structure does not allow us to distinguish between models where S.unvax flows only to E.unvax from one where it is also allowed to flow to E.vax. To address this problem, we need three more partition columns for matching up valid S-E pairs. The following table provides an example of this.

#>   from to flow flow_type from_partition to_partition flow_partition
#> 1    S  E  foi      rate            Epi          Epi            Epi
#>   from_to_partition from_flow_partition to_flow_partition
#> 1               Vax

File Formats

Both CSV files have the following format.

Additionally, the only characters that are allowed in this file are uppercase and lowercase letters, digits, underscores, and commas. The presence of any other character is sufficient to invalidate the file.

The settings.json file contains a single object with the following key-value pairs.

The derivations.json file contains a single array of objects. Each of these objects has the following required key-value pairs.

Each of these objects has the following optional key-value pairs, although some of these optional pairs become required given the existence of others.