 Research
 Open Access
 Published:
Assessing parameter identifiability in compartmental dynamic models using a computational approach: application to infectious disease transmission models
Theoretical Biology and Medical Modellingvolume 16, Article number: 1 (2019)
Abstract
Background
Mathematical modeling is now frequently used in outbreak investigations to understand underlying mechanisms of infectious disease dynamics, assess patterns in epidemiological data, and forecast the trajectory of epidemics. However, the successful application of mathematical models to guide public health interventions lies in the ability to reliably estimate model parameters and their corresponding uncertainty. Here, we present and illustrate a simple computational method for assessing parameter identifiability in compartmental epidemic models.
Methods
We describe a parametric bootstrap approach to generate simulated data from dynamical systems to quantify parameter uncertainty and identifiability. We calculate confidence intervals and mean squared error of estimated parameter distributions to assess parameter identifiability. To demonstrate this approach, we begin with a lowcomplexity SEIR model and work through examples of increasingly more complex compartmental models that correspond with applications to pandemic influenza, Ebola, and Zika.
Results
Overall, parameter identifiability issues are more likely to arise with more complex models (based on number of equations/states and parameters). As the number of parameters being jointly estimated increases, the uncertainty surrounding estimated parameters tends to increase, on average, as well. We found that, in most cases, R_{0} is often robust to parameter identifiability issues affecting individual parameters in the model. Despite large confidence intervals and higher mean squared error of other individual model parameters, R_{0} can still be estimated with precision and accuracy.
Conclusions
Because public health policies can be influenced by results of mathematical modeling studies, it is important to conduct parameter identifiability analyses prior to fitting the models to available data and to report parameter estimates with quantified uncertainty. The method described is helpful in these regards and enhances the essential toolkit for conducting modelbased inferences using compartmental dynamic models.
Background
Mathematical modeling is commonly applied in outbreak investigations for analyzing mechanisms behind infectious disease transmission and explaining patterns in epidemiological data [1, 2]. Models also provide a quantitative framework for assessing intervention and control strategies and generating epidemic forecasts in real time. However, the successful application of mathematical modeling to investigate epidemics depends upon our ability to reliably estimate key transmission and severity parameters, which are critical for guiding public health interventions. In particular, parameter estimates for a given system are subject to two major sources of uncertainty: noise in the data and assumptions built in the model [3]. Ignoring this uncertainty can result in misleading inferences and potentially incorrect public health policy decisions.
Appropriate and flexible approaches for estimating parameters from data, evaluating parameter and model uncertainty, and assessing goodness of fit are gaining increasing attention [4,5,6,7,8]. For instance, model parameters can be estimated by connecting models with observed data through various methods, including leastsquares fitting [9], maximum likelihood estimation [10, 11], and approximate Bayesian computation [12, 13]. An important, yet often overlooked step in estimating parameters is examining parameter identifiability – whether a set of parameters can be uniquely estimated from a given model and data set [14]. Lack of identifiability, or nonidentifiability, occurs when multiple sets of parameter values yield a very similar model fit to the data. Nonidentifiability may be attributed to the model structure (structural identifiability) or due to the lack of information in a given data set (practical identifiability), which could be associated with the number of observations, spatialtemporal resolution (e.g., daily versus weekly data), and observation error. A parameter set is considered structurally identifiable if any set of parameter values can be uniquely mapped to a model output [15]. As such, structural identifiability is the first step in understanding which model parameters can be estimated from data of certain state(s) of the system at a specific spatialtemporal resolution. Structurally identifiable parameters may still be nonidentifiable in practice due to a lack of information in available data. The socalled “practical identifiability” considers realworld data issues: amount of noise in the data and sampling frequency (e.g., data collection process) [14].
Several methods have been proposed to examine structural identifiability of a model without the need of experimental data; these include Taylor series methods [15, 16], differential algebrabased methods [17, 18], and other mathematical approaches [15, 19]. These methods tend to work better in the context of simple rather than complex models. Model complexity, in general, is a function of the number of parameters necessary to characterize the states of the system and the spectrum of dynamics that can be recovered from the model. Model complexity affects the ability to reliably parameterize the model given the available data [3], so there is a need for flexible, mathematicallysound approaches to address parameter identifiability in models of varying complexity. Here, we present a general computational method for quantifying parameter uncertainty and assessing parameter identifiability through a parametric bootstrap approach. We demonstrate this approach through examples of compartmental epidemic models with variable complexity, which have been previously employed to study the transmission dynamics and control of various infectious diseases including pandemic influenza, Ebola, and Zika.
Methods
Compartmental models
Compartmental models are widely used in epidemiological literature as a populationlevel modeling approach that subdivides the population into classes according to their epidemiological status [1, 20]. Compartmental dynamic models are specified by a set of ordinary differential equations and parameters that track the temporal progression of the number of individuals in each of the states of the system [3, 21]. Dynamic models follow the general form:
⁞
Where \( {\dot{x}}_i \) is the rate of change of the system states (where i = 1, 2, …, h) and Θ = (θ_{1}, θ_{2}, …, θ_{m}) is the set of model parameters.
The basic reproductive number (denoted R_{0}) is often a parameter of interest in epidemiological studies, as it is a measure of potential for a given infectious disease to spread within a population. Mathematically, it is defined as the average number of secondary infections produced by a single index case in a completely susceptible population [22]. R_{0} represents an epidemic threshold for which values of R_{0} < 1 indicate a lack of disease spread, and values of R_{0} > 1 are consistent with epidemic spread. In the midst of an epidemic, R_{0} estimates provide insight to the intensity of interventions required to achieve control [23]. R_{0} is a composite parameter value, as it depends on multiple model parameters (e.g., transmission rate, infectious period), and while R_{0} is not directly estimated from the model, it can be calculated by relying on the uncertainty of individual parameters.
A simple and commonly utilized compartmental model is the SEIR (susceptibleexposedinfectiousremoved) model [1]. We apply our methodology to this lowcomplexity model and work through increasingly more complex models as we demonstrate the approach for assessing parameter identifiability.
Model 1: Simple SEIR (pandemic influenza)
We analyze a simple compartmental transmission model that consists of 4 parameters and 4 states (Fig. 1). We apply this model to the context of the 1918 influenza pandemic in San Francisco, California [23]. Individuals in the model are classified as susceptible (S), exposed (E), infectious (I), or recovered (R) [1]. We assume constant population size, so S + E + I + R = N, where N is the total population size. Susceptible individuals progress to the exposed class at rate βI(t)/N, where β is the transmission rate, and I(t)/N is the probability of random contact with an infectious individual. Exposed, or latent, individuals move to the infectious class at rate k, where 1/k is the average latent period. Infectious individuals recover (move to recovered class) at rate γ, where 1/γ corresponds to the average infectious period.
The transmission process can be modeled using the following system of ordinary differential equations (where the dot denotes time derivative):
The auxiliary variable C(t) tracks the cumulative number of infectious individuals from the start of the outbreak. It is not a state of the system of equations, but simply a class to track the cumulative incidence cases; meaning, individuals from the population are not moving to class C. The number of new infections, or the incidence curve, is given by \( \dot{C}(t) \).
For this model, there is only one class contributing to new infections (I), so R_{0}, or the basic reproductive number, is simply the product of the transmission rate and the average infectious period: R_{0} = \( \frac{\beta }{\gamma } \) .
Model 2: SEIR with asymptomatic and hospitalized/diagnosed and reported
We use a simplified version of a complex SEIR model that consists of 8 parameters and 6 system states (Fig. 2). This model was originally developed for studying the transmission dynamics of the 1918 influenza pandemic in Geneva, Switzerland [24]. In the model, individuals are classified as susceptible (S), exposed (E), clinically ill and infectious (I), asymptomatic and partially infectious (A), hospitalized/diagnosed and reported (J), or recovered (R). Hospitalized individuals are assumed to be as infectious as individuals in the I class. Again, constant population size is assumed, so S + E + I + A + J + R = N. Susceptible individuals progress to the exposed class at rate β[I(t) + J(t) + qA(t)]/N, where β is the transmission rate, and q is a reduction factor of transmissibility in the asymptomatic class (0 < q < 1). A proportion, ρ, of exposed/latent individuals (0 < ρ < 1) become clinically infectious at rate k, while the rest (1 ρ) become partially infectious and asymptomatic at the same rate k. Asymptomatic cases progress to the recovered class at rate γ_{1}. Clinically ill and infectious individuals are diagnosed at a rate α or recover without being diagnosed at rate γ_{1}. Diagnosed individuals recover at rate γ_{2}.
The transmission process can be modeled using the following system of ordinary differential equations:
In the above system, C(t) represents the cumulative number of diagnosed/reported cases from the start of the outbreak, and \( \dot{C}(t) \) is the incidence curve of diagnosed cases.
For this model, there are three classes contributing to new infections (A, I, J), so the reproductive number is the sum of the contributions from each of these classes: R_{0} = R_{0}^{A} + R_{0}^{I} + R_{0}^{J}, where:
R_{0}^{A} = (fraction of asymptomatic cases) x (transmission rate) x (relative transmissibility from asymptomatic cases) x (mean time in asymptomatic class)
R_{0}^{I} = (fraction of symptomatic cases) x (transmission rate) x (mean time in clinically infectious class)
R_{0}^{J} = (fraction of symptomatic cases that are hospitalized) x (transmission rate) x (mean time in hospital) [24]
Here, \( {R}_0=\beta \Big[\left(1\rho \right)\left(\frac{q}{\gamma_1}\right)+\rho \left(\frac{1}{\gamma_1+\alpha }+\frac{\alpha }{\left({\gamma}_1+\alpha \right){\gamma}_2}\right) \)].
Model 3: The Legrand et al. model (Ebola)
We analyze an Ebola transmission model [25] comprised of 15 parameters and 6 states (Fig. 3). This model subdivides the infectious population into three stages to account for transmission in three settings: community, hospital, and unsafe burial ceremonies. Individuals are classified as susceptible (S), exposed (E), infectious in the community (I), infectious in the hospital (H), infectious after death at funeral (F), or recovered/removed (R). Constant population size is assumed, so S + E + I + H + F + R = N. Susceptible individuals progress to the exposed class at rate (β_{I}I(t) + β_{H}H(t) + β_{F}F(t))/N where β_{I}, β_{H}, and β_{F} represent the transmission rates in the community, hospital, and at funerals, respectively. Exposed individuals become infectious at rate α. A proportion, 0 < θ < 1, of infectious individuals are hospitalized at rate γ_{h}. Of the proportion of infectious individuals that are not hospitalized (1θ), a proportion, 0 < δ_{1} < 1, move to the funeral class at rate γ_{d}, and the rest (1 δ_{1}) move to the recovered/removed class at rate γ_{i}. A proportion, 0 < δ_{2} < 1, of hospitalized individuals progress to funeral class at rate \( {\upgamma}_{dh}=\frac{1}{\frac{1}{\upgamma_d}\frac{1}{\upgamma_h}} \). The remaining proportion (1 δ_{2}) are recovered/removed at rate \( {\upgamma}_{ih}=\frac{1}{\frac{1}{\upgamma_i}\frac{1}{\upgamma_h}} \). δ_{1} and δ_{2} are calculated such that δ represents the case fatality ratio (Table 3). Individuals in the funeral class are removed at rate γ_{f}.
The transmission process is modeled by the following set of ordinary differential equations:
Here, C(t) represents the cumulative number of all infectious individuals, and \( \dot{C}(t) \) is the incidence curve for infectious cases.
The basic reproductive number is the sum of the contributions from each of the infectious classes (I, H, F): R_{0} = R_{0}^{I} + R_{0}^{H} + R_{0}^{F}, where:
R_{0}^{I} = (transmission rate in the community) x (mean time in infectious class)
R_{0}^{H} = (fraction of hospitalized cases) x (transmission rate in the hospital) x (mean time in hospital class)
R_{0}^{F} = (fraction of cases that have traditional burial ceremonies) x (transmission rate at funerals) x (mean time in funeral class)
Here, \( {R}_0=\frac{\beta_I}{\Delta }+\frac{\frac{\gamma_h\theta }{\gamma_{dh}{\delta}_2+{\gamma}_{ih}\left(1{\delta}_2\right)}{\beta}_H}{\Delta }+\frac{\gamma_d{\delta}_1\left(1\theta \right){\beta}_F}{\gamma_f\Delta }+\frac{\gamma_{dh}{\gamma}_h{\delta}_2\theta {\beta}_F}{\gamma_f\left({\gamma}_{ih}\left(1{\delta}_2\right)+{\gamma}_{dh}{\delta}_2\right)\Delta }, \)
where ∆ = γ_{h}θ + γ_{d}(1 − θ)δ_{1} + γ_{i}(1 − θ)(1 − δ_{1}) [25].
Model 4: Zika model with human and mosquito populations
The last example is a compartmental model of Zika transmission dynamics that includes 16 parameters and 9 states and incorporates transmission between two populations – humans and vectors (Fig. 4). This model was designed to investigate the impact of both mosquitoborne and sexually transmitted (humantohuman) routes of infection for cases of Zika virus [26]. In the human population, individuals are classified as susceptible (S_{h}), asymptomatically infected (A_{h}), exposed (E_{h}), symptomatically infectious (I_{h1}), convalescent (I_{h2}), or recovered (R_{h}). The mosquito, or vector, population is broken into susceptible (S_{v}), exposed (E_{v}), and infectious (I_{v}) classes. Note that the subscript ‘h’ is used for humans and ‘v’ is used for vectors. Constant population size is assumed in both populations, so S_{h} + A_{h} + E_{h} + I_{h1} + I_{h2} + R_{h} = N_{h} and S_{v} + E_{v} + I_{v} = N_{v}.
A proportion 0 < θ < 1 of susceptible humans move to the exposed class at rate ab(I_{v}(t)/N_{h}) + β[(αE_{h}(t) + I_{h1}(t) + τI_{h2}(t))/N_{h})] where a is the mosquito biting rate, b is the transmission probability from an infectious mosquito to a susceptible human, β is the transmission rate between humans, α is the relative (humantohuman) transmissibility from exposed humans to susceptible, and τ is the relative transmissibility from convalescent humans compared to susceptible. Exposed individuals progress to symptomatically infectious at rate κ_{h} and then progress to the convalescent stage at rate γ_{h1}. Convalescent individuals recover at rate γ_{h2}. The remaining proportion of susceptible individuals (1  θ) become asymptomatically infected at the same rate, ab(I_{v}(t)/N_{h}) + β[(αE_{h}(t) + I_{h1}(t) + τI_{h2}(t))/N_{h}]. Asymptomatic humans recover at rate γ_{h} and do not contribute to new infections in this model.
Susceptible mosquitos move to the exposed class at rate ac[(ρE_{h}(t) + I_{h1}(t))/N_{h}], where c is the transmission probability from a symptomatically infectious human to a susceptible mosquito, and ρ is the relative humantomosquito transmission probability from exposed humans to symptomatically infected. Exposed mosquitos become infectious at rate κ_{v}. Mosquitos also leave the population at rate μ_{v}, where 1/μ_{v} is the mosquito lifespan.
The transmission process, including both populations, is represented by the set of differential equations below:
C(t) represents the cumulative number of symptomatically infectious human cases, and \( \dot{C}(t) \) contains the incidence curve for symptomatic human cases.
For this example, we have two transmission processes to consider when calculating R_{0}: sexual transmission (R_{hh}) and mosquitoborne (R_{hv}). The human population has three classes contributing to new infections: exposed, symptomatically infectious, and convalescent, so:
The mosquito population only has one infectious class (I_{v}); the reproductive number is given by:
The overall basic reproductive number, considering both transmission routes, is given by the following eq. [26]:
Simulated data
For each model we simulate 200 epidemic datasets (directly from the corresponding set of ordinary differential equations) with Poisson error structure using the daily time series data of case incidence, or total number of new cases daily. Parameters for each model are set at values based on their corresponding application: the 1918 influenza pandemic in San Francisco (Model 1) [23], 1918 pandemic influenza in Geneva (Model 2) [24], 1995 Ebola in Congo (Model 3) [25], and 2016 Zika in the Americas (Model 4) [26]. As explained below, the simulated data are generated using a bootstrap approach, and we then use these data to study parameter identifiability within a realistic parameter space for each model. Parameter descriptions and their corresponding values for each model are given in Tables 1, 2, 3 and 4.
Parameter estimation
To estimate parameter values, we fit the model to each simulated dataset using nonlinear least squares estimation. The lsqcurvefit function in Matlab (Mathworks, Inc.) is used to find the least squares best fit to the data. This process searches for the set of parameters \( \widehat{\varTheta} \)= (\( \widehat{\theta} \)_{1}, \( \widehat{\theta} \)_{2},…, \( \widehat{\theta} \)_{m}) that minimizes the sum of squared differences between the simulated data and the model solution [3]. The model solution \( f\left({t}_i,\widehat{\varTheta}\right) \) represents the best fit to the time series data.
For this method, the initial parameter predictions affect the solution for the model as local minima occur. While we know the true parameter values (used to generate the data), this is unrealistic for a realworld modeling scenario. We vary the initial guesses of the parameter values to vary according to a uniform distribution in the range of +/− 0.1 around the true value. Another approach would consist of repeating the least squares fitting procedure several times with different initial parameter guesses and selecting the best model fit.
For each model, the sets of parameters are denoted by Θ_{i}, where i represents the number of parameters being jointly estimated. We begin with estimating one model parameter, while fixing the rest, and then increase the number of parameters jointly estimated by one until all parameters of interest are included. Population size, N, is always fixed to the true value. Also, while R_{0} is not being directly estimated from the model, it is a composite parameter that can be calculated using individual parameter estimates.
For each model described above, we explore parameter identifiability for the following sets of parameters. Here, the symbol ^ is used to indicate an estimated parameter, while the absence of this symbol indicates that the parameter is set to its true value from the simulated data.
(i) Model 1: Simple SEIR
(ii) Model 2: SEIR with asymptomatic and hospitalized/diagnosed and reported
(iii) Model 3: The Legrand Model (Ebola)
(iv) Model 4: Zika model with human and mosquito populations
Bootstrapping method
We use the parametric bootstrap approach [3, 27, 28] for simulating the error structure around the deterministic model solution in order to evaluate parameter identifiability. This computational approach involves repeatedly sampling observations from the bestfit model solution. Here we use a Poisson error structure, which is the most popular distribution for modeling count data [3]. The stepbystep approach to quantify parameter uncertainty is as follows:

1.
Obtain the deterministic model solution (total daily incidence series) using nonlinear leastsquares estimation (Section 2.3).

2.
Generate S replicate datasets, assuming Poisson error structure:
Using the deterministic model solution \( f\left({t}_i,\widehat{\varTheta}\right) \), generate S (for our examples, S = 200) replicate simulated datasets \( {f}_S^{\ast}\left({t}_i,\widehat{\varTheta}\right) \). To incorporate Poisson error structure, we use the incidence curve, \( \dot{C}(t) \), as follows. For each time point t, we generate a new incidence value using a Poisson random variable with mean=\( \dot{C}(t) \). This new set of data represents an incidence curve for the system, assuming the time series follows a Poisson distribution centered on the mean at time points t_{i}.

3.
Reestimate model parameters: For each simulated dataset, derive the bestfit estimates for the parameter set using leastsquares fitting (Section 2.3). This results in S estimated parameter sets: \( \widehat{\varTheta} \)_{i} where i = 1, 2, …, S.

4.
Characterize empirical distributions and construct confidence intervals: Using the set of S parameter estimates, we can characterize the empirical distribution and construct confidence intervals for each estimated parameter. Also, for each set of estimated parameters, R_{0} is calculated to obtain a distribution of R_{0} values as well.
Parameter identifiability
When a model parameter is identifiable from available data, its confidence interval lies in a finite range of values [29, 30]. Using the bootstrapping method outlined in Section 2.4, we obtain 95% confidence intervals from the distributions of each estimated parameter. A small confidence interval with a finite range of values indicates that the parameter can be precisely identified, while a wider range could be indicative of lack of identifiability. To assess the level of bias of the estimates, we calculate the mean squared error (MSE) for each parameter. MSE is calculated as: \( MSE=\frac{1}{S}\sum \limits_{i=1}^S{\left(\uptheta \widehat{\uptheta_i}\right)}^2 \) where θ represents the true parameter value (in the simulated data), and \( \widehat{\theta_i} \) represents the estimated value of the parameter for the i^{th} bootstrap realization.
When a parameter can be estimated with low MSE and narrow confidence, this suggests that the parameter is identifiable from the model. On the other hand, larger confidence intervals or larger MSE values may be suggestive of nonidentifiability.
Results
Model 1: Simple SEIR
Additional files 1, 2 and 3: illustrate the empirical distributions of the estimated parameters, where Additional file 1: represents the results for \( \widehat{\varTheta} \)_{1}(β only), Additional file 2: for \( \widehat{\varTheta} \)_{2} (β and γ), and Additional file 3: for \( \widehat{\varTheta} \)_{3} (β, γ, and κ). The figures also show the original simulated data and the 200 simulated datasets for each estimated parameter set.
Estimating only β (Θ_{1}), results in precise (small confidence interval range) and unbiased (small MSE) estimates of β. Similarly, estimating β and γ (Θ_{2}) provides precise and unbiased estimates for both parameters. The precision of the estimates can be seen in Fig. 5: the confidence intervals for the estimates (represented by red vertical lines) remain close to the true parameter value (blue horizontal dotted line). The MSE plot (Fig. 6) shows an MSE value of < 10^{− 7} for β in Θ_{1} and values of < 10^{− 4} for both β and γ in Θ_{2}.
Simultaneously estimating all 3 parameters, β, κ, and γ (Θ_{3}), results in wider confidence intervals and larger MSE than the two previous subsets. The confidence intervals for β (0.516, 0.636) and γ (0.223, 0.277) have a narrow range and enclose the true values of the parameters. The MSE for these two are larger compared to the previous subsets, though all MSE values are < 10^{− 2}. The confidence interval for κ has a slightly larger range (0.440, 0.613), though this correlates with a small latent period difference of less than a day. Also, the MSE for κ is comparable to the other parameters. This indicates that all three parameters can be identified from daily incidence data of the epidemic curve with Poisson error structure.
Moreover, R_{0} can be estimated precisely with unbiased results. Despite the larger confidence intervals for the other parameters estimated in Θ_{3} (compared to Θ_{1}, Θ_{2}), the range around R_{0} is still very precise: (2.286, 2.317). Similarly, MSE for R_{0} is < 10^{− 4} for all runs. This indicates that the estimates of R_{0} are robust to variation or bias in the other parameter estimates – we will continue to explore this theme in the proceeding models.
Model 2: SEIR with asymptomatic and hospitalized/diagnosed and reported
Estimating β only (Θ_{1}) or β and γ_{1} (Θ_{2}) provides precise estimates with small MSE (Figs. 7 & 8). For each Θ_{i} (where i > 2), each additional parameter being estimated corresponds with, on average, a larger confidence interval range and higher MSE for each estimated parameter. Essentially, for each parameter, the uncertainty grows with the number of other parameters being jointly estimated. Θ_{3}, estimating β, γ_{1}, and α, provides estimates of β and γ_{1} with relatively small confidence ranges (95% CI: (0.717, 0.851), (0.192, 0.286), respectively) and MSE values (MSE = 0.0016, 7.15*10^{− 4}, respectively); however, estimates for α produce a wider range of values (0.386, 0.748), as well as an MSE value over 5 times higher than the other parameters (MSE = 0.0089), though still < 10^{− 2}.
Results for Θ_{4} and Θ_{5} indicate that none of the parameters can be wellidentified from case incidence data while simultaneously estimating > 3 parameters. For each, multiple parameters have MSE values > 10^{− 2} (Fig. 8), and the confidence intervals are comparatively wide. Additionally, the confidence intervals for ρ (Θ_{4}: (0.602, 0.858); Θ_{5}: (0.608, 0.763)) do not include the true value of 0.60.
Looking at confidence intervals and MSE (Figs. 7 & 8) for R_{0}, we find again that R_{0} is identifiable across each Θ_{i}. The confidence intervals for R_{0} all have a range < 0.2, and the MSE values for each Θ_{i} are < 10^{− 2}. These R_{0}results are consistent with those in Model 1, despite the identifiability issues of other parameters seen here in Model 2. This is an important result, indicating that even when identifiability issues exist in other model parameters, we can still provide reliable estimates of R_{0} without having to know the true values of the other parameters. It also shows that while noise in the data may affect parameter estimation for some parameters, composite parameters, like R_{0}, can still be accurately calculated from the same data.
Model 3: The Legrand model (Ebola)
Estimated parameter sets Θ_{1} and Θ_{2} (β_{I} only, β_{I} and β_{H} respectively) result in unbiased (MSE < 10^{− 3}), precise estimates of the parameters (Figs. 9 & 10). However, when jointly estimating all three β values (Θ_{3}), only β_{I} is identifiable – the confidence interval is a finite range: (0.038, 0.102) and the estimates are unbiased (MSE = 2.71*10^{− 4}). Parameters β_{H} (0, 0.614) and β_{F} (0.097, 1.341) both have wide confidence intervals indicating uncertainty suggestive of nonidentifiability. Estimating four parameters (Θ_{4}), only β_{H} is identifiable with a small range and bias; whereas, the remaining three parameter estimates have larger confidence intervals (Fig. 9).
For Θ_{i} where i > 4, none of the parameters can be identified from the model/data. Each parameter (for runs Θ_{5} – Θ_{7}) has either a large confidence range and/or comparatively large MSE. Some parameters have MSE values < 10^{− 2} (Fig. 10), but the wide range of uncertainty around these parameters is still indicative of nonidentifiability (Fig. 9).
Remarkably, R_{0} can be precisely estimated with unbiased results for parameter sets Θ_{1} – Θ_{4} (Figs. 9 & 10). When simultaneously estimating five or more parameters, however, the associated uncertainty of all the parameters results in nonidentifiability of R_{0}. For Θ_{5}, for example, R_{0} estimates vary widely in the range (0.683, 2.821) with an MSE of 0.467. As previously mentioned, R_{0} is a threshold parameter (epidemic threshold at R_{0} = 1), so given the confidence interval including the critical value 1, we would not have the ability to distinguish between the potential for epidemic spread versus no outbreak.
Model 4: Zika model with human and mosquito populations
For this complex model, we find again that when estimating only 1 or 2 parameters (Θ_{1}, Θ_{2}), the parameters can be recovered precisely with unbiased results (Figs. 11 & 12). When jointly estimating more than two parameters (Θ_{i}: i > 2), nonidentifiability issues arise. It can be seen that the confidence intervals and MSE for β and γ_{h1} are very small, and thus they are identifiable. However, all of the confidence intervals and MSE values for each of the other parameters (Θ_{i}: i > 2) are representative of nonidentifiability. The parameter estimates have a large amount of uncertainty, represented by the large confidence intervals, and are also biased estimates of the true value: MSE > 10^{− 2} for all.
In terms of R_{0}, we can see that this composite parameter of interest is identifiable for all Θ_{i} (Figs. 11 & 12). Despite the large confidence intervals associated with some parameters (ex: Θ_{6} – γ_{h2}: (0.047, 0.573)), when estimating more than two parameters, R_{0} can still be estimated with low uncertainty: (Θ_{6} – R_{0}: (1.480, 1.486)). The R_{0} estimates have little error, as MSE < 10^{− 4} for all Θ_{i}. This is consistent with the previous models in that R_{0} estimates are robust to the uncertainty and bias of the other estimated parameters.
Discussion
In this paper we have introduced a simple computational approach for assessing parameter identifiability in compartmental models comprised of systems of ordinary differential equations. We have demonstrated this approach through various examples of compartmental models of infectious disease transmission and control. Using simulated time series of the number of new infectious individuals, we analyzed the identifiability of model characterizing transmission and the natural history of the disease. This type of analysis based on simulated data provides a crucial step in infectious disease modeling, as inferences based on estimates of nonidentifiable parameters can lead to incorrect or ineffective public health decisions. Parameter identifiability and uncertainty analyses are essential for assessing the stability of the parameter estimates. Hence, it is important for researchers to be mindful that a good fit to the data does not imply that parameter estimates can be reliably used to evaluate hypotheses regarding transmission mechanisms. Moreover, quantifying the uncertainty surrounding parameter estimates is key when making inferences that guide public health policies or interventions.
Our bootstrapbased approach is sufficiently general to assess identifiability for compartmental modeling applications. We have shown that this method works well for models of varying levels of complexity, ranging from a simple SEIR model with only a few parameters (Model 1) to a complex, dualpopulation compartmental model with a total of 16 parameters (Model 4). Other methods exist to conduct parameter identifiability analyses. Some methods, such as Taylor series methods [15, 16] and differential algebrabased methods [17, 18], require more mathematical analyses, which becomes increasingly complicated as model complexity increases. Other methods rely on constructing the profile likelihood for each of the estimated parameters to assess local structural identifiability [11, 14, 31, 32]. In this method, one of the parameters (θ_{i}) is fixed across a range of realistic values, and the other parameters are refit to the data using the likelihood function of θ_{i}. Thus, identifiability of the parameters is determined by the shape of the resulting likelihood profile. Depending on the assumptions of the error structure in the data and as models become increasingly more complex, derivation of the likelihood profile and confidence intervals becomes increasingly more difficult.
Overall, our analyses indicate that parameter identifiability issues are more likely to arise with more complex models (based on number of equations/states and parameters). For example, a set of 3 parameters (Θ_{3}) can be estimated with low uncertainty and bias from a simple model, like Model 1; however, for more complex models (Model 3, Model 4), estimating only 3 parameters from a single curve of case incidence resulted in lack of identifiability for at least one of the parameters in the set (Θ_{3}). Also, for Θ_{i} (recall: i represents number of parameters being jointly estimated), as i increases, the uncertainty surrounding estimated parameters tended to increase, on average, as well (Fig. 7). One strategy to resolve parameter identifiability issues consists of restricting the number of parameters being jointly estimated while fixing other parameter values and conducting sensitivity analyses.
Importantly, we found that R_{0} is a robust composite parameter, even in the presence of identifiability issues affecting individual parameters in the model. In Model 4, despite large confidence intervals and larger MSE for the estimated parameters, R_{0} estimates were contained in a finite confidence interval with little bias (Figs. 11 & 12). For example, for parameter set Θ_{6}, only two of the estimated parameters could be reliably identified from the data, yet R_{0} could be identified with little uncertainty or bias. These findings are in line with the identifiability results of R_{0} for a vectorborne disease model (similar to Model 4), even when other model parameters could not be properly estimated [14]. R_{0} is often a parameter of interest, as R_{0} values have been related to the size or impact of an epidemic [1]. Moreover, R_{0} estimates can be used to characterize initial transmission potential, assess the risk of an outbreak, and evaluate the impact of potential interventions, so it is beneficial to know we can reliably obtain R_{0} estimates, despite lack of identifiability in other parameters.
It is important to emphasize that our methodology is helpful to uncover identifiability issues which could arise from 1) the lack of information in the data or 2) the structure of the model. We also note that our examples assess identifiability of parameters by relying on the entire curve of incidence data of a single epidemic. Future work could include identifiability analyses in the context of limited data using different sections of the trajectory of the outbreak. We also assume that only one model variable (state) is observed, so future analyses could incorporate more than one observed variable to potentially improve the identifiability of parameters without changing the model. For example, for Model 3 (Ebola), the incidence curves of new hospitalized cases and new deaths could provide additional information that better constrain parameter estimates, thereby improving parameter identifiability results.
Conclusions
For modeling studies, we recommend conducting comprehensive parameter identifiability analyses based on simulated data prior to attempting to fit the model to data. It is important to emphasize that lack of identifiability could be due to lack of information in the data or the structure of the model. The analyses also help guide the set of parameters in the model that can be jointly estimated – identifiability issues may not arise until any given number of parameters are being simultaneously estimated. If the analysis indicates nonidentifiability of certain parameters, may have to be assessed in sensitivity analyses (rather than estimated) to address the identifiability issue.
In summary, the ability to make sound public health decisions regarding an infectious disease outbreak is crucial for the general health and safety of a population. Knowledge of whether a parameter is identifiable from a given model and data is invaluable, as estimates of nonidentifiable parameters should not be used to inform public health decisions. Further, parameter estimates should be presented with quantified uncertainty. The methodology presented in this paper adds to the essential toolkit for conducting modelbased inferences.
References
 1.
Anderson RM, May RM. Infectious diseases of humans: dynamics and control. Oxford: Oxford University Press; 1991.
 2.
Diekmann O, Heesterbeek JA, Metz JA. On the definition and the computation of the basic reproduction ratio R_{0} in models for infectious diseases in heterogeneous populations. J Math Biol. 1990;28(4):365–82.
 3.
Chowell G. Fitting dynamic models to epidemic outbreaks with quantified uncertainty: a primer for parameter uncertainty, identifiability, and forecasts. Infectious Disease Modelling. 2017;2:379–98.
 4.
He D, King A, King AA, Ionides EL. Plugandplay inference for disease dynamics: measles in large and small populations as a case study. J R Soc Interface. 2010;7(43):271–83.
 5.
Goeyvaerts N, Willem L, Van Kerckhove K, Vandendijck Y, Hanquet G, Beutels P, et al. Estimating dynamic transmission model parameters for seasonal influenza by fitting to age and seasonspecific influenzalike illness incidence. Epidemics. 2015;13:1–9.
 6.
Chowell G, Viboud C, Simonsen L, Merler S, Vespignani A. Perspectives on model forecasts of the 2014–2015 Ebola epidemic in West Africa: lessons and the way forward. BMC Med. 2017;15(1):42.
 7.
Banks HT, Holm K, Robbins D. Standard error computations for uncertainty quantification in inverse problems: asymptotic theory vs. bootstrapping. Math Comput Model. 2010;52:1610–25.
 8.
Gibson GJ, Streftaris G, Thong D. Comparison and assessment of epidemic models. Stat Sci. 2018;33(1):19–33.
 9.
Banks H, Davidian M, Samuels J Jr, Sutton K. An inverse problem statistical methodology summary. In: Chowell G, Hyman J, Bettencourt L, CastilloChavez C, editors. Mathematical and statistical estimation approaches in epidemiology. Dordecht, The Netherlands: Springer; 2009. p. 249–302.
 10.
Wu KM, Riley S. Estimation of the basic reproductive number and mean serial interval of a novel pathogen in a small, wellobserved discrete population. PLoS One. 2016;11(2):1–12.
 11.
Breto C. Modeling and inference for infectious disease dynamics: a likelihoodbased approach. Stat Sci. 2018;33(1):57–69.
 12.
Scranton K, Knape J, de Valpine P. An approximate Bayesian computation approach to parameter estimation in a stochastic stagestructured population model. Ecology. 2014;5:1418.
 13.
Abdessalem AB, Dervilis N, Wagg D, Worden K. Model selection and parameter estimation in structural dynamics using approximate Bayesian computation. Mech Syst Signal Process. 2018;99:306–25.
 14.
Kao YH, Eisenberg M. Practical unidentifiability of a simple vectorborne model: implications for parameter estimation and intervention assessment. Epidemics. 2018;25:89–100.
 15.
Miao H, Xia X, Perelson AS, Wu H. On identifiability of nonlinear ODE models and applications in viral dynamics. SIAM Rev. 2011;1:3.
 16.
Pohjanpalo H. System identifiability based on powerseries expansion of solution. Math Biosci. 1978;41:21–33.
 17.
Eisenberg MC, Robertson SL, Tien JH. Identifiability and estimation of multiple transmission pathways in cholera and waterborne disease. J Theor Biol. 2013;324:84–102.
 18.
Ljung L, Glad T. Testing global identifiability for arbitrary model parameterizations. IFAC Proceedings Volumes. 1991;24:1085–90.
 19.
Chis OT, Banga JR, BalsaCanto E. Structural identifiability of systems biology models: a critical comparison of methods. PLoS One. 2011;6(11):1–16.
 20.
Lloyd A. Introduction to epidemiological modeling: basic models and their properties; 2007.
 21.
Brauer F, van der Driessche P, Wu J, Allen LJS. Mathematical epidemiology. Berlin: Springer; 2008.
 22.
van den Driessche P, Watmough J. Reproduction numbers and subthreshold endemic equilibria for compartmental models of disease transmission. Math Biosci. 2002;180:29–48.
 23.
Chowell G, Nishiura H. Comparative estimation of the reproduction number for pandemic influenza from daily case notification data. J R Soc Interface. 2007;4(12):155–66.
 24.
Chowell G, Ammon CE, Hengartner NW, Hyman JM. Estimation of the reproductive number of the Spanish flu epidemic in Geneva, Switzerland. Vaccine. 2006;24:6747–50.
 25.
Legrand J, Grais RF, Boelle PY, Valleron AJ, Flahault A. Understanding the dynamics of Ebola epidemics. Epidemiol Infect. 2007;4:610.
 26.
Gao D, Lou Y, He D, Porco TC, Kuang Y, Chowell G, et al. Prevention and control of Zika as a mosquitoborne and sexually transmitted disease: A mathematical modeling analysis. Scientific Reports. 2016;6:28070.
 27.
Efron B, Tibshirani R. An introduction to the bootstrap. New York: Chapman & Hall; 1993.
 28.
Chowell G, Hengartner NW, CastilloChavez C, Fenimore PW, Hyman JM. The basic reproductive number of Ebola and the effects of public health measures: the cases of Congo and Uganda; 2005.
 29.
Cobelli C, RomaninJacur G. Controllability, observability and structural identifiability of multi input and multi output biological compartmental systems. IEEE Trans Biomed Eng. 1976;BME23(2):93.
 30.
Jacquez JA. Compartmental analysis in biology and medicine. 2nd ed. Ann Arbor: University of Michigan Press; 1985.
 31.
Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmuller U, et al. Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics. 2009;25(15):1923–9.
 32.
Nguyen VK, Binder SC, Boianelli A, MeyerHermann M, HernandezVargas EA. Ebola virus infection modeling and identifiability problems. Front Microbiol. 2015;6:257.
Acknowledgements
We thank Dr. Ping Yan (Public Health Agency of Canada) for interesting discussions relating to parameter identifiability.
Funding
GC acknowledges financial support from the NSF grant 1414374 as part of the joint NSFNIHUSDA Ecology and Evolution of Infectious Diseases program; UK Biotechnology and Biological Sciences Research Council grant BB/M008894/1.
Availability of data and materials
The datasets generated and/or analyzed in this study can be reproduced using the methods and Tables 14 or are available from the corresponding author on reasonable request. Matlab code is also available upon request.
Author information
Affiliations
Contributions
KR and GC designed the study. KR analyzed the data and KR and GC interpreted the data. GC and KR contributed to further draft and edit the manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Kimberlyn Roosa.
Ethics declarations
Ethics approval and consent to participate
All of the data employed in this study were generated through simulations. Data are deemed exempt from institutional review board assessment.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Model 1 – Θ_{1} (estimating β only): The histograms display the empirical distributions of the parameter estimates using 200 bootstrap realizations, where the solid red horizontal line represents the 95% confidence interval for parameter estimates, and the dashed red vertical line indicates the true parameter value. Note, κ and γ are set to their true values in the data. The bottom left graph shows the data from the model (blue circles), and 200 realizations of the epidemic curve assuming a Poisson error structure (light blue lines). The solid red line corresponds to the bestfit of the model to the data, and the dashed red lines correspond to the 95% confidence bands around the best fit. (TIF 5423 kb)
Additional file 2:
Model 1 – Θ_{2} (estimating β and γ): The histograms display the empirical distributions of the parameter estimates using 200 bootstrap realizations, where the solid red horizontal line represents the 95% confidence interval for parameter estimates, and the dashed red vertical line indicates the true parameter value. Note, κ is set to the true value from the data. The bottom left graph shows the data from the model (blue circles), and 200 realizations of the epidemic curve assuming a Poisson error structure (light blue lines). The solid red line corresponds to the bestfit of the model to the data, and the dashed red lines correspond to the 95% confidence bands around the best fit. (TIF 5423 kb)
Additional file 3:
Model 1 – Θ_{3} (estimating β, κ, and γ): The histograms display the empirical distributions of the parameter estimates using 200 bootstrap realizations, where the solid red horizontal line represents the 95% confidence interval for parameter estimates, and the dashed red vertical line indicates the true parameter value. The bottom left graph shows the data from the model (blue circles), and 200 realizations of the epidemic curve assuming a Poisson error structure (light blue lines). The solid red line corresponds to the bestfit of the model to the data, and the dashed red lines correspond to the 95% confidence bands around the best fit. (TIF 5423 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Compartmental models
 Parameter identifiability
 Uncertainty quantification
 Epidemic models
 Structural parameter identifiability
 Practical parameter identifiability