 Research
 Open Access
 Published:
Simulationbased assessment of model selection criteria during the application of benchmark dose method to quantal response data
Theoretical Biology and Medical Modelling volume 17, Article number: 13 (2020)
Abstract
Background
To employ the benchmark dose (BMD) method in toxicological risk assessment, it is critical to understand how the BMD lower bound for reference dose calculation is selected following statistical fitting procedures of multiple mathematical models. The purpose of this study was to compare the performances of various combinations of model exclusion and selection criteria for quantal response data.
Methods
Simulationbased evaluation of model exclusion and selection processes was conducted by comparing validity, reliability, and other model performance parameters. Three different empirical datasets for different chemical substances were analyzed for the assessment, each having different characteristics of the doseresponse pattern (i.e. datasets with rich information in high or low response rates, or approximately linear doseresponse patterns).
Results
The best performing criteria of model exclusion and selection were different across the different datasets. Model averaging over the three models with the lowest three AIC (Akaike information criteria) values (MA3) did not produce the worst performance, and MA3 without model exclusion produced the best results among the model averaging. Model exclusion including the use of the KolmogorovSmirnov test in advance of model selection did not necessarily improve the validity and reliability of the models.
Conclusions
If a uniform methodological suggestion for the guideline is required to choose the best performing model for exclusion and selection, our results indicate that using MA3 is the recommended option whenever applicable.
Background
To determine the reference dose of chemical substances, including food additives and agricultural chemicals, that cause the presence or absence of a harmful event (i.e. dichotomous outcome) so that the acceptable daily intake can be specified, a number of scientific approaches using doseresponse experimental data have been used. A popular toxicological method uses the responses to low dose exposures to confirm the absence of an outcome event. The highest dose that does not cause an event is referred to as the no observable adverse effect level (NOAEL), below which no outcome is expected. Multiplying this with a specified uncertainty factor that addresses uncertainties, including the biological species barrier between experimental animals and humans [1], the point of departure has been determined in practice. However, the determination of NOAEL depends on low dose data alone; so, if the number of experimental animals per dose is limited, this imposes a serious statistical limitation that involves nonnegligible sampling errors [2, 3]. An alternative method, the benchmark dose (BMD) method, was initially formalized by Crump [4]. The BMD method determines the threshold dose by fitting various statistical models to the doseresponse curve, which addresses the problems surrounding the use of NOAEL because it can account for the response data across different doses and can help in objectively calculating the point of departure. The BMD method can potentially be extremely useful in many scientific disciplines [5, 6]. The benchmark dose lower bound (BMDL), which is the lower (onesided) limit of the 95% confidence interval of BMD, can yield a point of departure that is comparable to that based on NOAEL (Fig. 1) [7, 8].
To employ the BMD method, it is critical to select the best performing BMDL by following the statistical fitting procedures of multiple mathematical models. Parameterized models only characterize reality, so multiple models (usually nine or more) are commonly fitted to the same experimental dataset. As a result, many BMDL values can act as the candidate of preferred reference dose. However, the reference dose should be the best performing BMDL and it must be selected, for example, as the one that gives the best fitting results [9]. There are two additional issues in selecting or determining the BMDL. First, the BMD method uses a specified percentile point (e.g. 10% of the benchmark response, abbreviated as BMD_{10}) as the threshold for the reference value, but the 10% percentile point is never strictly objective [10,11,12]. This is similar to using a pvalue of 5% in many hypothesis tests or other arbitrarily chosen threshold values. Second, some fitted models (e.g. the Weibull model) yield different parameter estimates when restrictions to the range of parameters are imposed in advance of the inference procedure [9]. Quantitative guides for such restrictions can be complicated for nonexpert users.
Although several technical problems exist, we believe that the biggest obstacle to the wide application of the BMD method in various governmental settings is the lack of uniform guidelines that specify the steps required to scrutinize fitting results and identify a single BMDL value for determining the acceptable daily intake. Objective guidelines are required to determine which candidate models should be included or excluded in the final evaluation report. Model exclusion has been attempted by goodnessoffit testing and by measuring arbitrarily defined marker of fit, e.g., the ratio of BMD to BMDL [13,14,15]. Nevertheless, model exclusion has not been consistently practiced (i.e. sometimes not conducted) and the criteria of exclusion have not been verified and/or harmonized across different studies. Further, more model selection methods have been discussed and developed [16, 17] than exclusion methods. A conservative approach is to use the modeling result that yields the lowest BMDL among all the fitted models [10, 18], which was recommended by the European Food Safety Authority in 2009 [19]. However, the model with the lowest BMDL might be the model with the broadest uncertainty; i.e. a wide confidence interval caused by a bad fit (e.g. even a fitted model can be rejected by Pearson’s chisquared test [10]). The AIC (Akaike information criteria) [16, 20] or BIC (Bayesian information criteria) [16] could be used as alternative ways of measuring goodnessoffit and selecting the model that gives the lowest value (i.e. the best fit model). However, having the lowest AIC does not guarantee that the goodnessoffit of the model around the low dose response will be successful and valid to yield an appropriate BMDL [10, 16]. Model averaging has been proposed as a possible solution [21,22,23] that may partly resolve the uncertainties associated with the use of mathematical models to explain the doseresponse data. An updated document in 2017 from the European Food Safety Authority [24] recommends the selection of multiple models with close AIC values (within ±2) and averaging the results from all the selected models. Model averaging has been proposed in various risk assessment settings using doseresponse data [25,26,27], but a standard application method of model averaging has yet to be decided, including the use of badly fitted model for averaging (e.g. model averaging over all converged models or averaging over wellfitted models only).
While all issues surrounding the use of the BMD method for quantal response data cannot be fully and immediately resolved, a simulationbased evaluation might help to identify a possible wellperforming pathway of model exclusion and selection. To support the formulation of technical guidelines for risk assessment practices for food safety in Japan, we conducted a simulation study to compare the performance of each and various combinations of model exclusion and selection criteria, as applied to three qualitatively different types of quantal response datasets.
Methods
Quantal response data
For the simulationbased assessment, we selected three datasets that are qualitatively different; i.e., (i) a dataset with frequent testing at doses with high response rates, (ii) a dataset with frequent testing at doses with low response rates, and (iii) a dataset with doses involving both high and low response rates. Specifically, the data were retrieved from animal experiments with (i) 1aminoanthraquinone with an outcome of eosinophilic droplet in proximal tubular epithelium in kidney in male rats [28], (ii) 2ethylhexyl vinyl ether with an outcome of centrilobular hypertrophy in liver stem cells in male rats [29], and (iii) acrylamide with an outcome of axon degeneration in peripheral nerve in male rats [30] as datasets (i), (ii), and (iii), respectively. In this study, we were not concerned with the biological properties of the experimental results or interpretations for toxicological assessment, rather we manually selected these datasets purely on the basis of the qualitative patterns of the observed doseresponse curves. The sample size for each determined dose was n = 13, 6, and 48 and the original study examined responses at 4, 4, and 5 different doses (thus involving a total of 52, 24 and 240 exposed animals in datasets (i), (ii), and (iii), respectively).
Using the total of nine different distributions that consist of 2–4 unknown parameters, the BMD method was employed to analyze the datasets. For each dataset, we first identified the bestfit model by selecting the model with the lowest AIC value, without imposing any parameter restrictions and without excluding any models in advance of model selection. The nine statistical models used in this study were:

Logistic model: \( \frac{1}{1+\exp \left(a bx\right)} \),

Probit model: Φ(a + bx),

Loglogistic model: \( g+\frac{1g}{1+\exp \left(bc\log (x)\right)} \),

Logprobit model: g + (1 − g)Φ(b + c log(x)),

Gamma model: \( g+\left(1g\right)\frac{1}{\varGamma (a)}{\int}_0^{bx}\left({t}^{a1}\exp \left(t\right)\right) dt \),

Weibull model: g + (1 − g)(1 − exp(−ax^{b})),

Multistage (quadratic) model: g + (1 − g) exp(−ax − bx^{2}),

Multistage (cubic) model: g + (1 − g) exp(−ax − bx^{2} − cx^{3}),

Quantallinear model: g + (1 − g) exp(−ax),
where a, b, and c represent unknown parameters, g is also an unknown parameter but it is used to represent the baseline response value for 0 ≤ g < 1, x is the dose, Φ(x) is the cumulative distribution function of the standard normal distribution at dose x, and Γ(x) is the gamma function at dose x. During the simulations, we regarded the identified best model for each chemical substance as the “reference model”. Such a true model is accompanied by the known lower bound of the benchmark dose with response level at 10% (i.e. unbiased BMDL_{10}) as derived from the maximum likelihood estimates of the parameters.
The statistical estimation was conducted using the maximum likelihood method, and the likelihood function was defined under the assumption that the quantal response data at a given dose follows a binomial distribution. Computation of the 95% confidence interval (CI), including BMDL and BMD upper bound (BMDU) (i.e. onesided upper 95% CI of BMD), relied on the bootstrapping method. Specifically, case resampling was performed using the Monte Carlo algorithm. We did not use the profile likelihood method to avoid a too conservative (underestimated) CI. We also did not use the parametric bootstrapping, because the sample sizes in the original datasets were small, and the use of multivariate normal distribution was not fully supported.
Simulationbased evaluation
We performed a simulationbased assessment of model performance using the three “reference models” with three different dose response curves. Briefly, our analysis goes by: (i) identification of a reference model for each dataset by AIC (Akaike Information Criteria), (ii) generation of a total of 1000 simulated datasets (each dataset includes fittings by 9 individual model) from the reference model, (iii) application of model exclusion criteria if available, (iv) application of one of the model selection criteria including methods using model averaging, and select or calculate one of the representative BMDL value from each dataset, and (v) BMDL values were evaluated in two aspects, the validity and the reliability.
Because of the statistical estimation that we performed, we considered that we knew the unbiased BMD_{10} and unbiased BMDL_{10} values that should be recovered by the BMD method using the simulated datasets. Specifically, we randomly generated a total of 1000 simulated datasets from the reference model (Fig. 2). The response outcome data were randomly generated from a binomial distribution for each examined exposure dose for the number of samples that were originally allocated for the given dose (i.e. n = 13, 6, and 48 dichotomous responses in each observation dose for the substances in datasets (i), (ii), and (iii)). For each replicated dataset, we fitted a total of nine standard distributions of the BMD method and examined whether an appropriate BMDL_{10} value could be recovered. To recover BMDL_{10} values, we imposed different combinations of model exclusion and selection criteria, which allowed us to assess which criteria would likely produce a valid and reliable estimate. The candidate of selection method includes the model averaging. To determine if the simulated criteria was valid and reliable, we evaluated the performance as follows:
(i) Validity.
The simulated BMDL_{10} value must be the dose lower than the unbiased BMD_{10} because the statistical role of BMDL is to act as the onesided 95% lower bound of BMD. Out of the total of 1000 simulations for each chemical substance, the validity was measured as
where l_{ij} is the BMDL_{10} value based on the ith simulated data and determined using model exclusion and selection criteria j, and B is the unbiased BMD_{10} value.
(ii) Reliability.
For the criteria to be reliable, similar results must be reproduced by repeating the same experiments. That is, the simulated BMDL_{10} value must be close to the unbiased BMDL_{10} value, and criteria that yield a “distant” BMDL_{10} value from the unbiased one would be regarded as a bad combination. Reliability was measured quantitatively as the relative distance from the unbiased BMDL_{10} as
where L is the unbiased BMDL_{10} value.
In addition to validity and reliability, we also assessed the calculability of the BMDL value. That is, the proportion of simulated datasets that yielded convergence and thus the BMDL value out of the total simulated datasets (1000) was assessed. Moreover, to avoid the impact of substantial exclusions before model selection on the calculability assessment, we also calculated the same proportion out of the total simulated datasets that survived the model exclusion process. Lastly, as a potential pitfall of simulationbased studies, it is important to remember that the original true model is likely to be recovered more often than other models, especially if a higher number of samples is tested for each dose. To avoid overoptimistic interpretation of the simulated results, we calculated the proportion of the same statistical model, out of the nine candidate models, that was recovered to be identical to the original (i.e. reference) model out of 1000 simulated datasets. If the selected statistical model was the same as the reference model, it is possible that the corresponding result may have been recovered due to the computational nature of the simulation study (e.g. random simulations using the Weibull model may lead to the choice of the Weibull model in each simulation run).
Model exclusion and selection criteria
We considered a total of four possible model exclusion criteria and six possible model selection criteria. Avoiding excessive combinations of the two (i.e. multiple exclusion criteria plus model averaging over preferred models only), we tested and compared a total of 18 possible combinations.
The four model exclusion criteria were (i) no exclusion, (ii) implementing goodnessoffit testing using the KolmogorovSmirnov test (KS test) to avoid models with p < 0.10, (iii) KS test to exclude models with p < 0.10 and also exclusion of models with the BMD/BMDL ratios > 10, and (iv) KS test to exclude models with p < 0.10 and also exclusion of models with BMDU/BMDL ratios > 10 [31]. We used the KS test rather than Pearson’s chisquared or Fisher’s exact test because the experimental sample sizes were very small [31,32,33]. BMD/BMDL and BMDU/BMDL ratios > 10 were excluded because models with ratios that exceed 10 have been regarded as precise enough to yield a proper confidence limit [34,35,36]. Only models that survived these exclusion procedures were used in the model selection process.
Among the six model selection criteria, three were single selection criteria and three were model averaging methods. For the single selection criteria: (i) select the model with the lowest BMDL value to be conservative as part of risk assessment practice (Lowest BMDL) [10, 18]; (ii) select the model with the lowest BMD value, not necessarily relying on the lower uncertainty bound (Lowest BMD) [14]; or (iii) select the model with the lowest AIC value as the best fit model (Lowest AIC) [14]. We also computed model averaging results, not by taking the average BMDL value, but by averaging all or part of the fitted models for each resampled data. Model averaging takes into account the model uncertainty by integrating results from all or selected models [26, 37,38,39,40]. We considered three different patterns of model averaging: (i) model averaging over all nine models (MAall) [25]; (ii) model averaging over three models with the lowest three AIC values (MA3) [25]; and (iii) model averaging over all models that yielded AIC values within 3 of the lowest AIC value (MAAIC). Let π_{i}(d) the doseresponse curve of ith model and d the given dose, MAall was calculated as \( {\pi}_{\mathrm{MAall}}(d)={\sum}_{i=1}^9{w}_i{\pi}_i(d) \) where \( {w}_k=\frac{\exp \left({I}_k/2\right)}{\sum_{i=1}^9\exp \left({I}_i/2\right)} \) and I_{k} is the AIC value of model k. MA3 was calculated using the same formula with normalization over the three bestfit models as judged by AIC. Similarly, MAAIC was computed using the arithmetic average of models (i.e. averaging without weight function) and adhering to rules of thumb [17], averaging all models with AIC within 3 of the lowest AIC of the bestfit model. The weight function was not used for MAAIC because, in this instance, models with similar AIC values are regarded as equally well fitted models. MA3 and MAAIC are intended to conduct averaging over well fitted models compared with MAall, so we did not examine a combination of model exclusion with MA3 or MAAIC to avoid similar removal of badfit models multiple times.
Results
The validity and reliability of the simulation results for the 1aminoanthraquinone dataset, which contained frequent testing at doses with high response rates, are listed in Table 1. BMDL_{10} was 0.92 and BMD_{10} was 7.67 under the selection of Probit model as the reference model with the lowest AIC value (Fig. 3), and resamplingbased simulations were performed. The lowest BMDL or lowest BMD yielded the best validity results, except when the exclusion using the KS test and BMDU/BMDL ratio was applied in advance of model selection. The lowest BMD following model exclusion using both the KS test and BMD/BMDL ratio yielded the best reliability results. The lowest AIC was among the worst criteria, although about 1/3 of simulation results selected by the lowest AIC were produced by the Probit model, i.e. the reference model. Model averaging results yielded intermediate ranks among all model exclusion and selection criteria, and MA3 produced the best reliability and validity results among the model averaging techniques.
Similarly, simulation results for the 2ethylhexyl vinyl ether dataset, which contained frequent testing at doses with low response rates, are shown in Table 2. BMDL_{10} was 24.69 and BMD_{10} was 28.65 under the selection of the Probit model as the unbiased model (Fig. 3). The validity was highest using the lowest BMDL or the lowest BMD for model selection whenever the same model exclusion was applied in advance. Model averaging, especially, MA3 yielded the best reliability performance. Model exclusion did not improve the validity, rather it decreased the reliability estimates of MAall. No model exclusion changed the calculability of BMDL, and only a small improvement in the reliability of MAall was obtained by model exclusion.
The simulation results for the acrylamide dataset, which contained doses involving both high and low response rates, are shown in Table 3. BMDL_{10} was 0.79 and BMD_{10} was 0.94 under the selection of the logistic model as the unbiased model (Fig. 3). The validity was highest using the lowest BMDL for model selection, and reliability was best when MA3 was used. The logistic model, the unbiased doseresponse curve for acrylamide, was selected for about every 1 in 3 selected models. The sample size in this dataset was larger than those in the other two datasets, the models converged at a higher frequency than they did in the other simulations and were rarely excluded by the BMD/BMDL or BMDU/BMDL ratio.
Discussion
The BMD method is now widely used to determine the reference dose for toxicological risk assessment in food chemicals, agricultural chemicals, and environmental hazards. However, governmental experts are often puzzled by several ambiguous parts of model assessment, especially the model exclusion and selection processes. As part of the technical assessment for possible improvements in the guidelines, we conducted a simulationbased experiment to assess the model exclusion and selection process by comparing the validity, reliability, and other model performance indicators using all possible combinations of model exclusion and selection criteria. For the exposition, we examined three different empirical datasets, each with different characteristics of the doseresponse pattern (i.e. the datasets had rich information about high or low response rates, and approximately linear doseresponse patterns). By replicating 1000 sets of hypothetical experimental data computationally in a random manner, we found that the best criteria of model exclusion and selection were different across the chemical substances in each dataset. Further, the best criteria for achieving good validity was not necessarily the best for ensuring good reliability. For instance, the lowest BMDL outperformed the other criteria in achieving high validity, but did not always yield the best reliability. The use of lowest AIC yielded the best reliability result for the acrylamide dataset, but the worst reliability for the 1aminoanthraquinone dataset. Besides, the model averaging results always ranked at an intermediate level among all possible criteria, and did not yield the worst results.
There are two takehome messages. First, although we did not identify the best exclusion and selection criteria for the qualitatively differently distributed datasets, we have shown that model averaging over three models with the lowest three AIC values (MA3) did not yield the worst result, and MA3 without prior model exclusion produced the best results among all the model averaging results. For instance, MA3 yielded the best reliability result for the 2ethylhexyl vinyl ether dataset. If a uniform guideline to implement model exclusion and selection is required, our results indicate that MA3 could become the recommended option whenever applicable. Second, we found that model exclusion using the KS test and the ratios of BMD or BMDU to BMDL did not necessarily yield better validity and reliability than nonexclusion. In particular, both the validity and reliability for the 1aminoanthraquinone dataset were made worse by imposing exclusion. For example, by applying the exclusion criteria of KS test and the ratio of BMDU and BMDL, reliability (mean distance) of Lowest BMDL has been increased from 0.4 to 27.2 as compared with nonexclusion (Table 1). In contrast, validity (rate of “successful” calculation) of MAall has been decreased from 98.8 without exclusion to 90.2 as applied of KS test and the ratio of BMDU and BMDL (Table 1). Thus, to decide about model exclusion, visual assessment might be enough to proceed to model selection.
Model averaging has previously been demonstrated as a useful option when determining the point of departure [25], especially for datasets that do not necessarily exhibit a sigmoidal doseresponse curve. We found that all the model averaging options that we tested performed well overall. However, how the distance metric (e.g. AIC) across different models can be account for and how model uncertainty of each parametric assumption in the process of averaging can be quantified still need to be considered. Considering that at least nine models are fitted to the same dataset and some of the models share similar properties while others do not, which models should be averaged needs to be considered, e.g. averaging over all models or only some of them. We found that averaging over some of the models might yield a better performance than averaging over all converged models, considering that the uncertainties of wellfitted models might be far smaller than those of badly fitted models. Averaging over the three best models is still a subject of debate (e.g., averaging over two best models rather than three) and the numbers might change depending on the total number of models to be tested (e.g. more than nine models could be tested) [25]. However, considering that averaging over the three best models outperformed all the models with close AIC values, reliance only on the penalized likelihood during the averaging might not be a good option. For now, MA3 is the method that we recommend, and we plan to share the programing code and a package for this procedure in the future.
It must be noted that the recommended option does not work when the total number of converged models is one or two; indeed, the convergence of one model alone can occur occasionally. In such an instance, other criteria, including using the modeling results that yield the lowest BMDL or the model with the lowest AIC value, need to be considered. What the present study has shown is that both the lowest BMDL and lowest AIC did not act as the unique best method for model selection, whereas the lowest BMDL method can ensure good validity, which is understandable from the conservative nature of this method. It should be noted that the use of the lowest AIC was ranked as part of the worst result for two of the datasets (the exception was the acrylamide dataset) when it comes to validity and reliability.
Five technical limitations should be considered. First, we examined only three different chemical substances as source of information and addressed qualitative differences only among the three datasets. More datasets may have revealed additional insights into ranking the model selection criteria. Second, if a specific dataset behaved uniquely, there should be a corresponding unique criterion that is best suited to its analysis. However, our objective was to identify acceptable model selection criteria across qualitatively different datasets (which found MA3 was acceptable overall) and we were not able to classify doseresponse curves into several different groups for better fitting. Third, we used only computer simulations. Using the reference model prior to simulations might have been preferred during the estimation process. Although we counted this bias in Tables 1, 2 and 3, the impact of this on our examined criteria is not known. Fourth, we did not explore parameter constraints in this study. Fifth, we did not examine other percentile cutoff levels, i.e. the benchmark response was fixed at 10%.
While numerous technical issues have yet to be explored in applying BMD methods to risk assessment, we concluded that MA3 can be considered the best guiding option to derive the reference dose when the guidelines are expected to specify a single model exclusion and selection method.
Conclusion
As part of the technical assessment for possible improvements in the guidelines, we conducted a simulationbased experiment to assess the model exclusion and selection process by comparing the validity, reliability, and other model performance indicators using all possible combinations of model exclusion and selection criteria. If a uniform methodological suggestion for the guideline is required to choose the best performing model for exclusion and selection, our results indicate that using MA3 is the recommended option whenever applicable.
Abbreviations
 AIC:

Akaike Information Criteria
 BMD:

Benchmark dose
 BMDL:

Benchmark dose lower bound
 BMDU:

Benchmark dose upper bound
 CI:

Confidence interval
 MA:

Model averaging
 MA3:

Model averaging over three models with the lowest three AIC
 MA AIC < 3:

Model averaging over models with the difference of AIC less than 3 from the bestfit model with the lowest AIC value
 MAall:

Model averaging over all converged models
 NOAEL:

No observable adverse effect level
References
 1.
Barnes DG, Dourson M. Reference dose (RfD): description and use in health risk assessments. Regul Toxicol Pharmacol. 1988;8(4):471–86.
 2.
Brown KG, Erdreich LS. Statistical uncertainty in noobservedadverseeffect level. Fundament Appl Toxicol. 1989;13:235–44.
 3.
Leisenring W, Ryan L. Statistical properties of the NOAEL. Regul Toxicol Pharmacol. 1992;15:161–71.
 4.
Crump K. A new method for determining allowable daily intakes. Fundament Appl Toxicol. 1984;4(5):854–71.
 5.
Bi J. Using the benchmark dose (BMD) methodology to determine an appropriate reduction of certain ingredients in food products. J Food Sci. 2010;75(1):R9–R16.
 6.
Kimmel CA, Gaylor DW. Issues in qualitative and quantitative risk analysis for developmental toxicology. Risk Anal. 1988;8(1):15–20.
 7.
Allen BC, Kavlock RJ, Kimmel CA, Faustman EM. Doseresponse assessment for developmental toxicity II. Comparison of generic benchmark dose estimates with no observed adverse effect levels. Fundament Appl Toxicol. 1994;23(4):487–95.
 8.
Wignall JA, Shapiro AJ, Wright FA, Woodruff TJ, Chiu WA, Guyton KZ, et al. Standardizing benchmark dose calculations to improve sciencebased decisions in human health assessments. Env Health Perspect. 2014;122(5):499–504.
 9.
United States Environment Protection Agency. Benchmark Dose Technical Guidance. Washington DC: United States Environment Protection Agency; 2012. p. 29–31.
 10.
Sand S, Falk FA, Victorin K. Evaluation of the benchmark dose method for dichotomous data: model dependence and model selection. Regul Toxicol Pharmacol. 2002;36:184–97.
 11.
Weterings PJJM, Loftus C, Lewandowski TA. Derivation of the critical effect size/benchmark response for the doseresponse analysis of the uptake of radioactive iodine in the human thyroid. Toxicol Letters. 2016;22:38–43.
 12.
Allen BC, Kavlock RJ, Kimmel CA, Faustman EM. Doseresponse assessment for developmental toxicity III. Statistical models. Fundament Appl Toxicol. 1994;23:496–509.
 13.
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999;14:382–401.
 14.
Shao K, Gift JS. Model uncertainty and Bayesian model averaged benchmark dose estimation for continuous data. Risk Anal. 2014;34(1):101–20.
 15.
IPCS. Principles for modelling dose–response for the risk assessment of chemicals. Geneva: World Health Organization; 2009. p. 44–5.
 16.
Burnham KP, Anderson DR. Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res. 2004;33(2):261–304.
 17.
Hjort NL, Clarskens G. Frequentist model average estimators. JASA. 2003;98(464):879–99.
 18.
Sand S, Victorin K, Filipsson AF. The current state of knowledge on the use of the benchmark dose concept in risk assessment. J Appl Toxicol. 2008;28:405–21.
 19.
European Food Safety Authority. Guidance of the scientific committee on a request from EFSA on the use of benchmark dose approach in risk assessment. EFSA J. 2009;1150:1–72.
 20.
Akaike H. A Bayesian analysis of the minimum AIC procedure. Ann Inst Stat Math. 1978;30(1):9–14.
 21.
Kang SH, Kodell RL, Chen JJ. Incorporating model uncertainties along with data uncertainties in microbial risk assessment. Regul Toxicol Pharmacol. 2000;32(1):68–72.
 22.
Wheeler MW, Bailer AJ. Model averaging software for dichotomous dose response risk estimation. J Stat Software. 2008;26(5):1–15. https://doi.org/10.18637/jss.v026.i05.
 23.
Fletcher D, Turek D. Modelaveraged profile likelihood intervals. J Agr Biol Environment Stat. 2012;17(1):38–51.
 24.
European Food Safety Authority. Update: use of the benchmark dose approach in risk assessment. EFSA J. 2017;15(1):4658.
 25.
Wheeler MW, Bailer AJ. Properties of modelaveraged BMDLs: a study of model averaging in dichotomous risk estimation. Risk Anal. 2007;27(3):659–70.
 26.
Wheeler MW, Bailer AJ. An empirical comparison of lowdose extrapolation from points of departure (PoD) compared to extrapolations based upon methods that account for model uncertainty. Regul Toxicol Pharmacol. 2013;67(1):75–82.
 27.
Wheeler MW, Bailer AJ. Comparing model averaging with other model selection strategies for benchmark dose estimation. Environment Ecol Stat. 2009;16(1):37–51.
 28.
Doseresponse data on 1aminoanthraquinone. http://dra4.nihs.go.jp/BMD/RawData/BMD_82451_kidney_eosinophilic_droplet_in_proximal_tubular_epithelium.txt. (Accessed 18 Dec 2019).
 29.
Doseresponse data on 2ethylhexylvinylether. http://dra4.nihs.go.jp/BMD/RawData/BMD_103446_liver_centrilobular_hypertrophy_m.txt.com. (Accessed 18 Dec 2019).
 30.
National Toxicological Program. NTP technical report on the toxicology and carcinogenesis studies of acrylamide (CAS No. 79–061) in F344/N rats and B6C3F1 mice (feed and drinking water studies). Durham: National Toxicological Program; 2012. p. 69–70.
 31.
Massey F. The KolmogorovSmirnov test for goodness of fit. JASA. 1951;46(253):68–78.
 32.
Justel A, Peña D, Zamar R. A multivariate KolmogorovSmirnov test of goodness of fit. Stat Prob Letters. 1997;35:251–9.
 33.
David FN, Johnson NL. The probability integral transformation when parameters are estimated from the sample. Biometrika. 1948;35(1/2):182–90.
 34.
Muri SD, Schlatter JR, Brüschweiler BJ. The benchmark dose approach in food risk assessment: is it applicable and worthwhile? Food Chem Toxicol. 2009;47(12):2906–25.
 35.
Moffat I, Chepelev NL, Labib S, BourdonLacombe J, Kuo B, Buick JK, et al. Comparison of toxicogenomics and traditional approaches to inform mode of action and points of departure in human health risk assessment of benzo [a] pyrene in drinking water. Crit Rev Toxicol. 2015;45:1–43.
 36.
Matsumoto M, HirataKoizumi M, Kawamura T, Sakuratan S, Ono A, Hirose A. Validation of the statistical parameters and model selection criteria of the benchmark dose methods for the evaluation of various endpoints in repeateddose toxicity studies. Fundament Toxicol Sci. 2019;6(4):125–36.
 37.
Shao K, Small MJ. Potential uncertainty reduction in modelaveraged benchmark dose estimates informed by an additional dose study. Risk Anal. 2011;31(10):1561–75.
 38.
Piegorsch WW, An L, Wickens AA, West RW, Peña EA, Wu W. Informationtheoretic modelaveraged benchmark dose analysis in environmental risk assessment. Environmetrics. 2013;24:143–57.
 39.
Bailer AJ, Noble RB, Wheeler MW. Model uncertainty and risk estimation for experimental studies of quantal responses. Risk Anal. 2005;25(2):291–9.
 40.
Morales KH, Ibrahim JG, Chen CJ, Ryan LM. Bayesian model averaging with applications to benchmark dose estimation for arsenic in drinking water. JASA. 2006;p101(473):9–17.
Acknowledgments
We thank Margaret Biswas, PhD, from Edanz Group (www.edanzediting.com/ac) for editing a draft of this manuscript.
Funding
This study was commissioned under a grant for the 2018 Cabinet Office Research for Assessment of the Effect of Food on Human Health, Japan (ID: 1801, PI: Akihiko Hirose). This work was also in part supported by grants from the 2019 Cabinet Office Research for Assessment of the Effect of Food on Human Health, Japan (ID: 1907, PI: Hiroshi Nishiura), the Japan Agency for Medical Research and Development (JP18fk0108050), the Japan Science and Technology Agency (JST) CREST program (JPMJCR1413), the Smoking Research Foundation, and the Japan Society for the Promotion of Science (JSPS) KAKENHI (17H04701, 17H05808, 18H04895, and 19H01074). The funding bodies had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Contributions
HN conceived the study. HN and AH conceptualized the study design, KI and AH collected the empirical data, HN formulated the analytical flow of simulations, KY and YT performed the simulations. KY and HN drafted an early version of the manuscript. All authors gave comments on the revised manuscript and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The present study examined publicly available experimental data and was primarily a computerbased simulation study. As such, the datasets employed in this study did not require ethical approval.
Consent for publication
Not applicable.
Competing interests
The authors declare that coauthor H. Nishiura is the EditorinChief of Theoretical Biology and Medical Modelling. This does not alter the authors’ adherence to all the journal’s policies on sharing data and materials. The other authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yoshii, K., Nishiura, H., Inoue, K. et al. Simulationbased assessment of model selection criteria during the application of benchmark dose method to quantal response data. Theor Biol Med Model 17, 13 (2020). https://doi.org/10.1186/s1297602000131w
Received:
Accepted:
Published:
Keywords
 Risk assessment
 Doseresponse curve
 Toxicology
 Simulation
 Model averaging
 Benchmark dose