Estimating transmission probability in schools for the 2009 H1N1 influenza pandemic in Italy

Background Epidemic models are being extensively used to understand the main pathways of spread of infectious diseases, and thus to assess control methods. Schools are well known to represent hot spots for epidemic spread; hence, understanding typical patterns of infection transmission within schools is crucial for designing adequate control strategies. The attention that was given to the 2009 A/H1N1pdm09 flu pandemic has made it possible to collect detailed data on the occurrence of influenza-like illness (ILI) symptoms in two primary schools of Trento, Italy. Results The data collected in the two schools were used to calibrate a discrete-time SIR model, which was designed to estimate the probabilities of influenza transmission within the classes, grades and schools using Markov Chain Monte Carlo (MCMC) methods. We found that the virus was mainly transmitted within class, with lower levels of transmission between students in the same grade and even lower, though not significantly so, among different grades within the schools. We estimated median values of R 0 from the epidemic curves in the two schools of 1.16 and 1.40; on the other hand, we estimated the average number of students infected by the first school case to be 0.85 and 1.09 in the two schools. Conclusions The discrepancy between the values of R 0 estimated from the epidemic curve or from the within-school transmission probabilities suggests that household and community transmission played an important role in sustaining the school epidemics. The high probability of infection between students in the same class confirms that targeting within-class transmission is key to controlling the spread of influenza in school settings and, as a consequence, in the general population. Electronic supplementary material The online version of this article (doi:10.1186/s12976-016-0045-2) contains supplementary material, which is available to authorized users.

A Text of the questionnaire The questionnaire was delivered to parents (or care-givers) of the schools' students in the first days of December 2009, and returned to school teachers. Its text was the following: Has your child, student in the school XY, been sick of influenza-like syndrome (*) since October 15? (*) It is defined as influenza-like syndrome: fever ≥ 38 • C, together with at least one of the following symptoms: • head-ache • malaise • shivering • extreme fatigue and at least one of the following respiratory symptoms: • cough • sore throat • nasal congestion.

YES/NO.
If the answer is yes, please report the date of the first symptoms.
For each member of the same household, please report the age, whether it has been sick of influenza-like syndrome and, if so, the date of the first symptoms.

B The algorithm used
The available data are 1. whether students have shown influenza symptoms or not; 2. the day of symptom onset.
All this information will be named Z. For the moment, we neglect the problem of non-respondents.
We wish to obtain a posteriori distributions for the parameters q s , q A , q B (q A and q B are functions of q g and q c , see below) and γ (collectively named ϑ). As computing the likelihood of the data Z would be very complex, we denote Y the dates of infection of the individuals with symptoms onset, and we choose where π(ϑ) is the chosen a priori (uniform) distribution for the real parameters ϑ.
Then, through Bayes' formula where using the quantities (total number of infectious in grade j) (total number of infectious in the school) Non-respomdents are handled by including in the vector of added parameters Y the state (eventually infected or not) and, if so, the dates of infection and removal for all students whose information is missing.
The posterior distribution is estimated as the stationary distribution of the Markov chain obtained implementing the Metropolis-Hastings algorithm [2,4].

C Parameters updating
To update the parameters we use a Single Component Metropolis-Hastings (see [2]). Denote y the current parameter (either q c or q g or q s or ε). We update the parameter through the formula x = ye ra 1−y+ye ra where x denotes the proposed updated parameter, a is from a normal and r has been adjusted to obtain good mixing.
So, by simple calculations, we obtain that, and then q(y|x) q(x|y) is equal to As the days of infection are discrete variables, we updated them by a different method. Precisely, every cycle we randomly selected a class, and 3 students with symptoms within that class, to be updated; since our assumptions leave only two choices for the infection day, we simply changed the differences (infection day) − (symptoms day) from −1 to −2 or from −2 to −1 for the 3 infected students.
We proceeded similarly for the infection status of missing data, and for their symptoms day, allowing, in the latter case, for all possible days of observation.

D Test on simulated data D.1 Joint estimation of transmission probability and infectious period
We started with the simple case of a single class, aiming at the estimate of the probabilities to remain infectious for two days, γ, and to be infected from someone inside the same class, q c . We performed a series of simulations varying γ from 0.1 (the probability to remain infectious for two days is very low) to γ = 0.9 (the probability to remain infectious for two days is very high) with a step of 0.1; correspondingly, q c is changed in such a way that R 0 (in this case nq c (1 + γ)) remains constant.
In  Table S1: Parameter values for the probability of remaining infective for two days γ, and the class infection probability qc used in the 9 sets of simulations as to have R0 ≈ 1.48 with a class of n = 25 students and the fictional case of a class of n = 250 students.
children and for a (fictional) case of a class of 250 children. Figure S1 shows the results obtained in the case of a class of n = 25 children or in the (fictional) case of a class of 250 children. q c reference q c median (d) Figure S1: Estimates of γ (panels a and c) and of qc (panels b and d) in 9 sets of simulations performed by varying γ and qc while keeping R0 ≈ 1.48 (in Table S1). Panels a) and b) have n = 25, c) and d) have n = 250. The reference values used in the simulations are represented as white dots, while the medians of the posterior distribution are represented as black dots, with the bars representing 95%-credible intervals.
It may be noticed that the mean value of estimated γ is always close to 0.5, independently of the value of γ used in the simulation. Increasing n reduces the width of the credible intervals, but does not remove the bias. Also the mean estimated value of q c is almost constant, independently of the value used in the simulations, at the value that would be correct for γ = 0.5.
Because of this problem, in all the following simulations we set γ equal to 0.1 without estimating it.  Table S2; for each simulation we ran the MCMC algorithm to obtain posterior distributions of the parameters q c , q g , q s and ε. The results are shown in panel a) of Figure S2. q c > q s q c > q g q g > q s Figure S2: (a) Ten sets of 50 simulations starting from reference values represented as white dots (in Table   S2) and ε = 10 It can be seen that the estimates are reasonably correct, with the mean of the posterior distributions around the reference values. It may be noticed, however, that, when the three parameters q c , q g and q s are close to each other, the algorithm tends to overestimate q c , the transmission rate in the same class.

D.2 Simulations of school model
We also examined whether the 95%-credible intervals for the three parameters intersect each other.
The answers are shown in Figure S2 (panel b)).
It can be seen that, when the parameters are indeed equal, around 5% of the times one obtains nonintersecting 95%-credible intervals, something close to expectations.
On the other hand, a difference is picked up almost always between q c and q s from set 4 onwards (i. e. when the ratio q c /q s ≈ 3.8) and between q c and q g from set 8 onwards (i. e. when the ratio q c /q g ≈ 3.7), while the ratio q g /q s never reaches such values, and thus a difference between q g and q s is seen only occasionally in the simulations.
Finally we present in Figure S3 the estimates obtained for R 0 . In order to display the variability among simulations, we show for each simulation only the median estimate of R 0 : in the left panel, this is obtained by using formula (3) of the main text on the joint posterior distribution of (q c , q g , q s ) computed with the MCMC algorithm as described above; in the right panel it is obtained using R 0 = 1 + rT I where

D.3 Test with missing data
To test the behavior of the algorithm when dealing with missing data, as in the case of actual data, we re-analyzed the previous simulations, but assuming that a random 20% of the data of each class is not reported and thus cannot be used in the estimation procedure. The results obtained are shown in Figure S4 (panels a) and b)). The results can be considered almost as satisfactory as in the case without missing data. As expected, the credible intervals are somewhat wider than in the case without missing data, and thus they intersect somewhat more often. It can also be remarked that, with missing data, the transmission rate inside the class q c is on average overestimated in all simulation sets.

D.4 Test with missing data and errors
We further tested the algorithm by assuming that 20% of the simulated data in each class are not reported and that only 70% of the reported dates are correct, while 20% of them are shifted of ±1 day and the remaining 10% of ±2 days. Figure S4 (panels c) and d)) shows the estimates obtained in this scenario.
Also in this case the estimation of the data is reasonably correct and this indicated that we have obtained a robust result, although the absolute value of q c is somewhat overestimated when the parameters are close to each other. q c > q s q c > q g q g > q s Figure S4: Fifty simulations (panels a and c) starting from reference values represented by white dots (in Table   S2) and ε = 10 −3 by assuming that 20% of the data of each class are not reported. The median estimated values are represented as black dots with the 95%-credible intervals. Fraction of simulations in which the 95%-credible intervals for the different parameters do not intersect in panels b) and d).
In panels a) and b) 20% of the data are considered to be missing. Beyond that, in panels c) and d) 70% of the reported dates are correct, 20% are shifted of ±1 day and the remaining 10% of ±2 days.