# Heterogeneity in multistage carcinogenesis and mixture modeling

- Sandro Gsteiger
^{1}Email author and - Stephan Morgenthaler
^{1}

**5**:13

https://doi.org/10.1186/1742-4682-5-13

© Gsteiger and Morgenthaler; licensee BioMed Central Ltd. 2008

**Received: **26 October 2006

**Accepted: **21 July 2008

**Published: **21 July 2008

## Abstract

Carcinogenesis is commonly described as a multistage process, in which stem cells are transformed into cancer cells via a series of mutations. In this article, we consider extensions of the multistage carcinogenesis model by mixture modeling. This approach allows us to describe population heterogeneity in a biologically meaningful way. We focus on finite mixture models, for which we prove identifiability. These models are applied to human lung cancer data from several birth cohorts. Maximum likelihood estimation does not perform well in this application due to the heavy censoring in our data. We thus use analytic graduation instead. Very good fits are achieved for models that combine a small high risk group with a large group that is quasi immune.

## Keywords

## Introduction

Cancers can arise in virtually any part of the body, and although there are many tissue specific properties, a general multistage framework for carcinogenesis holds for most cancer types. More precisely, cells must undergo an evolutionary process involving several stages and leading finally to a cell that has completely lost proliferation control. In a first step, called initiation, mutations transform stem cells into intermediate states. Such initiated cells may give rise to pre-neoplastic lesions via accelerated growth. Eventually, a cell out of such a clone may experience further mutations and be transformed into a malignant tumor cell. This second step comprising clonal expansion and final malignant transformation is commonly called promotion. This multistage scheme shows the inherently random aspect of carcinogenesis: mutations happen at random times and stochastic growth processes are involved.

Mathematical models of carcinogenesis have been studied for about fifty years. Some of the earliest attempts to build biologically based quantitative descriptions are [1] and [2], who explained cancer as the result of a sequence of mutations. A widely accepted model was proposed in [3] and [4]. Their two-stage clonal expansion (TSCE) model was explicitly formulated in terms of an initiation stage and a promotion stage. This approach stressed the importance of both mutations and clonal expansion in the process leading to cancer. The TSCE model has found many applications and extensions. One example is the multistage model, which takes up the same structure but allows for more than two stages. Due to this long and evolving story, we should not have in mind a single model when talking about the multistage model. We should rather have in mind a cascade of nested models that starts from a fundamental idea and incorporates through its evolution more and more biological detail. Excellent reviews of stochastic carcinogenesis modeling can be found in [5] and [6].

One part of the recent extensions tries to take population heterogeneity into account. Such heterogeneity can result from sources such as genetic variation, exposure to carcinogens due to either changes in environment or occupation, and differences in lifestyle (the most prominent factors being smoking and diet). In [7] a mixture of a one stage model and a two stage model was used to describe the heritable and the sporadic form of Retinoblastoma, a cancer of the eye caused by mutations in a single tumor suppressor gene. Other approaches incorporate heterogeneity via standard frailty modeling, where the common baseline hazard *h*_{0}(*t*) is multiplied by a non-negative random variable *Z* in order to model the individual hazard *h*_{
ind
}(*t*) = *Zh*_{0}(*t*), see for example [8], and [9] for such an approach.

In this text, we take up the work by [10]. These authors introduce two new population parameters to describe heterogeneity. The first one, called the fraction at risk *F*, is used to distinguish between susceptibles and a postulated group of immune individuals. The second one, called the fraction of deaths due to cancer among all deaths due to either cancer or related competing causes *f*, models competing related risks. They fit their model to US lung cancer incidence data from several birth cohorts. The parameters *F* and *f* present an abstract way describe population heterogeneity and are not linked to a specific biological process. Therefore, the above mentioned authors state that other modeling strategies could be tested. The present work gives such an attempt. We take up the same multistage model, but we will use mixture models to allow for variability among individuals. This allows us to introduce heterogeneity in a biologically meaningful way.

In the next section, we describe the multistage carcinogenesis model and introduce an extension by mixture. Then we will give a series of identifiability results for both the multistage model and some mixture models. Finally, we apply the model to human incidence data before giving some concluding remarks.

## Mathematical Model Formulation

### The Multistage Carcinogenesis Model

We will work with a simplified version of the multistage model, but one that is general enough to incorporate the two main features of the carcinogenesis process: the sequence of mutations and the clonal expansion. We make the following assumptions:

1. A cell must undergo *n* mutational events to get initiated.

2. The number of cells at risk, *N*_{0}, is constant over time.

3. The number of newly generated initiated cells is a (non-homogeneous) Poisson process with intensity *λ*_{
I
}(*t*).

4. An initiated cell gives rise to a clonal expansion according to a birth-and-death process with emigration, i.e. in a short time interval (*t*, *t* + Δ*t*) an initiated cell divides in two initiated cells with rate *β*, dies or differentiates with rate *δ*(<*β*), and divides into one initiated and one malignant cell with rate *μ*.

5. Once a promoted cell is generated, its growth is deterministic, and we neglect the time needed to grow to detectable tumor size.

6. The system starts with all at risk cells in the normal state and the different cells act independently of one another.

*N*

_{0}) has been discussed by several authors. However, we choose this simplified version in order to limit the complexity of our baseline model. Note also that for

*n*= 1 we get the classical TSCE model.

*F*

_{ P }(

*x*) is the cdf for the waiting time for the first malignant transformation within a clone starting with one initiated cell at time 0. This cdf is improper since a clone of initiated cells dies out with a probability greater than 0 if

*δ*> 0. Its exact form is

where *θ* = *β* - *δ* - *μ* and $\Delta =\sqrt{{(\beta +\delta +\mu )}^{2}-4\beta \delta}$.

Researchers have expressed concern about the approximations used in carcinogenesis models. This issue was raised in a review paper by [13] and has inspired [11] and [14]. Our formula above is exact and uses the method based on integration cited by [[11], top of p. 1080]. The remaining simplifications in our model, in particular the constancy of *N*, are for convenience and do not affect the conclusions of the paper. As a general comment, it should be noted that the term two-stage refers to different things in different papers. It is, for example, possible to model clones via compartments or via branching processes. Both may employ the same parameter notation, but the interpretation will be quite different. Care has to be taken, if one wishes to stay close to biological reality. For further comments, see [[13], section 2.1].

*h*(

*t*;

*n*) =

*-*d log

*S*(

*t*;

*n*)/d

*t*. In order to deduce the asymptotic behavior of the hazard for several

*n*, we note that

*h*(

*t*;

*n*) can be written in terms of a recursion,

*h*(

*t*; 1) =

*νN*

_{0}

*F*

_{ P }(

*t*) and

*h*(

*t*;

*n*) = $n\nu {\displaystyle {\int}_{0}^{t}h}(u;n-1)\text{d}u$ for

*n*≥ 2. Therefore, when

*n*= 1 the hazard levels off as

*t*goes to infinity. More precisely, the hazard of the TSCE model goes to the finite asymptote

But the hazard grows to infinity if *n* ≥ 2. In both cases *h*(*t*; *n*) is strictly monotonic increasing with *t*. The unboundedness of the hazard is due to the simplifications in the model. This may lead to appreciable differences at low ages in some types of cancer or under hightened exposure. It is typically of lesser importance in human studies.

The monotonicity properties of hazard curves are not in agreement with observed incidence curves from human population data. Such data typically shows very low incidence up to about the age of fifty, a sudden and sharp increase between about the ages of fifty and eighty, and a subsequent leveling off and a decrease for the very old. This behavior at old ages is not captured by the hazard curves *h*(*t*; *n*). However, it can be modeled very easily by incorporating a frailty effect as we will show in the application later on.

### Extension of the Model by the Use of Mixture Distributions

An observed human population is heterogeneous. Though the process of cancer development is similar for everyone, parameters may vary between individuals. Since all parameters of the model are biologically meaningful, we aim at modelling heterogeneity directly through these parameters. We thus propose to consider some of the biological parameters – one at a time – as random variables.

*θ*be such a parameter and let

*G*(

*θ*) be a distribution function for

*θ*. Then, we will denote by

*S*(

*t*|

*θ*) the survivor function of the multistage model (1) for a given value

*θ*, whereas the population survivor function is

The distribution function *G* must then be selected based on the biological parameter *θ* chosen. If we consider for example *θ* = *n*, the number of mutations needed for initiation, it is natural to choose a finite distribution, i.e. P(*θ* = n_{i}) = *π*_{i} for a fixed set {*n*_{1},...,*n*_{
g
}} ⊂ ℕ such that $\sum {\pi}_{i}}=1$. This would correspond to *g* population subgroups having inherited different numbers of initiating mutations. The model could also be interpreted as a multiple pathway model, where *g* pathways involving different numbers of mutations can lead to cancer. Other interesting choices focus on promotion, for example *θ* = *β* - *δ*, the growth advantage of initiated cells, or *θ* = *μ*, the promotion rate. These two cases would be consistent with both a finite or a continuous distribution as long as its support is contained within biologically reasonable bounds.

## Identifiability

Before fitting our mixture model to observed incidence data, the identifyability issue has to be considered. The parameters of the TSCE model cannot be uniquely determined based on incidence data. Some of the papers relevant to this issue are [15, 16], and [17].

### Identifiability of the Multistage Model

*n*, the number of initiating mutations, and

*ψ*= (

*N*

_{0},

*ν*,

*β*,

*δ*,

*μ*), sizes and rates. When fitting

*S*(

*t*|

*n*,

*ψ*), the survivor function of the multistage model given in (1), these six parameters cannot all be fitted separately. We will show that

*n*and the follwing three combinations are, however, uniquely determined

In order to deal with the discrepancy between the six parameters and the four identifyable features, we will hold the two parameters *N*_{0} and *δ* fixed (see also [18]). To determine *N*_{0} = *Number of stem cells in a given tissue* some preliminary biological estimate is needed. For the death rate we mainly focus on the choice *δ* = 0, which implies *γ* = *β* - *δ* = *β*, so that *β* simply describes the growth advantage of initiated cells.

#### The Number of Mutations for Initiation

Is *n* identifiable in the multistage model or could a change in *n* be compensated by some adjustment of the biological parameters *ψ* so that finally the same survivor function resulted? The answer to this question is no, because the behavior of *S*(*t*|*n*, *ψ*) at the origin is enough to determine *n*.

**Proposition 1.** *If for two parameter choices* (*n*, *ψ*), *and* ($\stackrel{~}{n}$, $\stackrel{~}{\psi}$) *we have S*(*t*|*n*, *ψ*) ≡_{
t
}*S*(*t*|$\stackrel{~}{n}$, $\stackrel{~}{\psi}$), *then n* = $\stackrel{~}{n}$.

*Proof:*Direct calculation shows that

for all *n* ∈ {1, 2,...}, where *S*^{(k)}= *d*^{
k
}*S*/*dt*^{
k
}. The proposition is a direct consequence.

This result is mainly of theoretical interest and cannot be used to estimate *n*. In practice, for some tissues one can fix *n* according to the available biological theory. For example in colon cancer it is commonly assumed that two mutations are necessary for initiation, followed by a third one for malignant transformation (see [19]). In cases where no biological reasoning is available, we suggest to fit the model for several choices of *n*. The form of the intensity of initiation given in expression (2) shows that estimates for *ν* are highly sensitive to the choice of *n*. Results within biologically reasonable limits will thus be obtained only for very few values *n*.

#### Growth and Mutation Rates

For *n* = 1 it has been shown in [16] that three functions of *ψ* are uniquely determined by *S*(*t*|*n* = 1, *ψ*). Their proof can be generalized to *n* ≥ 1. This is intuitively plausible. Given *n*, the intensity of initiation depends on the product *N*_{0}*ν*^{
n
}, but not on *N*_{0} and *ν* individually. And the speed at which a clone of initiated cells grows depends only on the difference *β* - *δ*, but not on the actual pair *β*, *δ*.

**Lemma 2.**

*Let*(

*n*,

*ψ*)

*and*($\stackrel{~}{n}$, $\stackrel{~}{\psi}$)

*be two sets of parameters such that S*(

*t*|

*n*,

*ψ*) ≡

_{ t }

*S*(

*t*|$\stackrel{~}{n}$, $\stackrel{~}{\psi}$).

*Then we have*

*Proof:*Let us define the integral

*I*(

*t*;

*n*,

*ψ*) = ${\int}_{0}^{t}{\lambda}_{I}}(t-x){F}_{P}(x)\text{dx$. This means that

*S*(

*t*|

*n*,

*ψ*) = exp{-

*I*(

*t*;

*n*,

*ψ*)}. Note that by Proposition 1 we have

*n*= $\stackrel{~}{n}$. So we must show that if

*I*(

*t*;

*n*,

*ψ*) via (

*n*- 1) repeated integrations by parts into a

*n*-fold integral. Next, we differentiate this expression

*n*times with respect to

*t*, to obtain

Application of these two steps to both sides of (5) proves the result.

### Identifiability of the Mixture Structure

*θ*. Then, $\mathcal{G}$ induces the family of mixture models

holds for all *G*_{1}, *G*_{2} ∈ $\mathcal{G}$. In other words, the population survivor function must uniquely determine the underlying mixing distribution within a pre-specified family.

*θ*

_{1},

*θ*

_{2},...} such that

*θ*

_{1}<

*θ*

_{2}< ... . Then the finite mixture model $S(t)={\displaystyle {\sum}_{i=1}^{g}{\pi}_{i}}S(t|{\theta}_{i})$ is identifiable if

This condition ensures identifiability of all finite mixtures of the survivor functions {*S*(*t*|*θ*_{
i
}); *i* = 1, 2,...} even without specifying the number of components *g*. Teicher's result requires additional regularity conditions, but these are trivially satisfied in the case of the multistage model (1).

#### Initiation

The next proposition shows that such a mixture is identifiable.

**Proposition 3.** *The family of finite mixture models induced by* {*S*(*t*|*n*, *ψ*); *n* = 1, 2,...} *is identifiable*.

*Proof:*We show that condition (6) is satisfied. The initiation incidence rates can be written recursively

*S*(

*t*|

*n*,

*ψ*) → 0 for

*t*→ ∞, we have Λ (

*t*) → ∞ for

*t*→ ∞. This implies

#### Promotion

Promotion is a complicated process and both genetic and epigenetic factors seem to be involved. Therefore, heterogeneity can be due to many different mechanisms. In the context of the multistage model, there are two main parameters these agents can influence: the growth advantage of initiated cells, *γ*, and the rate of malignant transformation, *μ*.

We can derive a result similar to the one in the previous section. Let there be a discrete set of *γ*-values 0 <*γ*_{1} <*γ*_{2} < ... . Note that we consider *δ* = *γ*_{
i
}- *β*_{
i
}as fixed, i.e. we assume in fact that there is an analogous sequence of *β*_{
i
}. From now on, we will write *ψ* for the parameter vector (*n*, *N*_{0}, *ν*, *δ*, *μ*).

**Proposition 4.** *The family of finite mixture models induced by* {*S*(*t*|*γ*_{
i
}, *ψ*); *i* = 1, 2,...} *is identifiable*.

*Proof:*We will first check condition (6) in the case

*δ*> 0. We have

*δ*> 0 implies that

*F*

_{ P }is improper and converges to a limit

*a*(

*γ*,

*δ*) < 1 as t → ∞. The value 1 -

*a*(

*γ*,

*δ*) is the probability that a clone of initiated cells (generated by a single initiated cell at time

*t*= 0) eventually dies out without ever giving rise to a promoted cell. The assumption

*γ*

_{i+1}>

*γ*

_{ i }implies that

*a*(

*γ*

_{i+1},

*δ*) >

*a*(

*γ*

_{ i },

*δ*), and as a consequence

*δ*= 0, and thus

*γ*

_{ i }=

*β*

_{ i }. The function

*F*

_{ P }is in this case equal to

*β*

_{ i }and

*β*

_{i+1}. A direct calculation shows that

- 1.
$\frac{\partial}{\partial \beta}{F}_{P}(0|\stackrel{~}{\beta})=0$ = 0,

- 2.
$\frac{\partial}{\partial \beta}{F}_{P}(x|\stackrel{~}{\beta})$ is non-negative for all

*x*≥ 0, and - 3.
$\frac{\partial}{\partial \beta}{F}_{P}(x|\stackrel{~}{\beta})$ asymptotically goes to 0 as

*x*→ ∞.

*t*

_{0}be the (unique) maximum of $\frac{\partial}{\partial \beta}{F}_{P}(x|\stackrel{~}{\beta})$. It follows that

which completes the proof.

The same idea could be applied to parameter *μ*. Though the biological interpretation of such a frailty model would be different, technically no new issues arise, and similar results can be established.

## Fitting Mixture Models

where *r*_{
i
}counts the population at risk and *o*_{
i
}counts the observed cancer cases during the time interval [*t*_{
i
}, *t*_{i+1}). The data is discussed in [21] and is publically available ([22]). Additional information is given in [23].

As mentioned earlier, the observed hazard has a peak at around 80 years and a decrease for the higher ages, while the hazard of the multistage model given by (1) is strictly monotonic increasing. Estimation of the parameters by analytic graduation as described later on leads to extremely poor fits, which for all ages above 30 and for all birth cohorts give quite useless predictions. The fault does, however, not lie with the methods of estimation but rather with the model. Thus, using the inverse of the variance of the estimated incidence rates as weights leads to almost the same poor fit. The unmixed multistage model does not succeed in describing the incidence rates in Figures 2 and 3. We will come back to this failure.

*γ*-frailty model

*π*

_{ l }≤ 1,

*π*

_{ l }+

*π*

_{ u }= 1, 0 <

*γ*

_{ l }<

*γ*

_{ u }. To get identifiability, we will fix

*N*

_{0}and

*δ*. But in order to get stable estimates, we fix also

*γ*

_{ l }and

*n*. Note that the parameters we estimate have a restricted domain of definition,

We will use suitable transformations to respect these constraints.

### Maximum Likelihood Estimation

*t*

_{ i },

*t*

_{i+1}) we observe

*o*

_{ i }failures due to cancer and have

*c*

_{ i }=

*r*

_{ i }-

*r*

_{i+1}-

*o*

_{ i }censored individuals. Under the assumption of independent and uninformative censoring, the likelihood function

*L*(

*π*

_{ l },

*ν*,

*γ*

_{ u },

*μ*|

*N*

_{0},

*δ*,

*n*,

*γ*

_{ l }) is given by

*ψ*= (logit

*π*

_{ l }, log

_{10}

*μ*). The log-likelihood essentially has a ridge starting the upper-right corner and running downwards as one moves to the left. This means that only a combination of the two parameters can be estimated precisely, but not both separately. The log-likelihood values of the estimates in Figure 5 are

*l*(${\stackrel{\u02c6}{\psi}}_{\text{ML}}$) = -1.338·10

^{6}and

*l*(${\stackrel{\u02c6}{\psi}}_{\text{LS}}$) = -1.355·10

^{6}. While these values appear to be close, they are in fact quite different in the likelihood metric, because

*r*

_{ i },

*o*

_{ i }) the real data set. Let us define the points (${\stackrel{~}{r}}_{i}^{k}$, ${\stackrel{~}{o}}_{i}^{k}$) by

^{6}and suppose that during every time interval exactly

*k*10

^{4}individuals die – either due to cancer or due to competing causes. We then fit model (7) by maximum likelihood as before (we consider again the four parameters

*π*

_{ l },

*ν*,

*γ*

_{ u },

*μ*as unknown). Figure 7 gives the estimated models for

*k*= 1,...,8. The MLE behaves better for small

*k*than for large

*k*. In Figure 8 we calculated the residual sums of squares (RSS) for these models, which seem to increase exponentially with the coefficient of variation of the ${\stackrel{~}{r}}_{i}^{k}$ sequence,

The above example shows that the MLE is dominated by the points corresponding to large "at risk" sets. The LSE, on the other hand, works fine, since it attributes equal weight to all age intervals. This makes one wonder whether a weighted least squares approach would suffer from the same problem as the MLE. If we give for example weights proportional to the population at risk, would the LSE break down as well? The answer to this question is clearly no. Considering Figure 4 once again, we realize that there is a model that fits all the data points very accurately. This model will be good even if we downweight the contribution to the RSS of the points at high ages. Any weighted least squares approach will select a model that is very close to the standard LSE.

### Analytic Graduation

The LS estimates shown in the previous figures were obtained by analytic graduation, which is a standard procedure to fit continuous curves to discretized data. A detailed discussion of the procedure and derivation of asymptotic results can be found in [24, 25].

*N*

_{0},

*δ*,

*γ*

_{ l }and

*n*. The value

*N*

_{0}acts as a scale parameter. Changes in

*N*

_{0}are compensated by

*ν*such that the product

*N*

_{0}

*ν*

^{ n }stays more or less constant. The other parameters also remain quite stable. The effect of

*δ*is rather fuzzy, no clear conclusions emerge. In all cases where enough data at old ages is available (i.e. all but the 1920s cohort), the estimated proportion of the population at high risk, ${\stackrel{\u02c6}{\pi}}_{u}$, is not sensitive to changes in the fixed parameter values. The peak of the observed hazard determines ${\stackrel{\u02c6}{\pi}}_{u}$ quite accurately. Finally, reasonable results can also be obtained for

*n*= 3, while other choices of

*n*produce unrealistic estimates for at least some of the parameters. Note that good fits can be achieved only as long as

*γ*

_{ l }is small enough. We set

*γ*

_{ l }= 10

^{-4}in the models given here.

Conditional parameter estimates given the fixed values *n* = 2, *N*_{0} = 10^{10}, *δ* = 0 and *γ*_{
l
}= 10^{-4}.

Cohort | Males | Females | ||||||
---|---|---|---|---|---|---|---|---|

${\stackrel{\u02c6}{\pi}}_{u}$ | $\stackrel{\u02c6}{\nu}$ | ${\stackrel{\u02c6}{\gamma}}_{u}$ | $\stackrel{\u02c6}{\mu}$ | ${\stackrel{\u02c6}{\pi}}_{u}$ | $\stackrel{\u02c6}{\nu}$ | ${\stackrel{\u02c6}{\gamma}}_{u}$ | $\stackrel{\u02c6}{\mu}$ | |

1880s | 0.021 | 2.5 × 10 | 0.183 | 3.5 × 10 | 0.003 | 4.9 × 10 | 0.134 | 2.2 × 10 |

1890s | 0.029 | 3.2 × 10 | 0.173 | 4.4 × 10 | 0.005 | 4.3 × 10 | 0.146 | 2.1 × 10 |

1900s | 0.034 | 3.6 × 10 | 0.167 | 5.2 × 10 | 0.007 | 4.8 × 10 | 0.168 | 1.3 × 10 |

1920s | 0.077 | 1.9 × 10 | 0.189 | 7.5 × 10 | 0.023 | 2.4 × 10 | 0.203 | 3.2 × 10 |

The least squares estimates for a single unmixed multistage model in which all the parameters except *N* and *δ* are fitted, lead to curves ressembling the maximum likelihood fit in Figure 3. The estimated parameters show that the simple multistage model attempts to distinguish between the earlier and later cohorts to a large extent by increasing the initiation rate *ν*. The increase is three-fold between 1880 and 1920 for the females and even six-fold for the males. The other parameters remain more or less constant across the cohorts. The incidence rates in females is much lower than in males. While the mixture model adjusts to this through an adjustment of the mixture weights, the single multistage model explains it by very different estimates of the promotion rate *μ* between the sexes.

*C*⊂ ℝ

^{4}, and then we look at projections of

*C*on the six parameter plains spanned by the four parameter axis. The confidence region we get for the EAMs 1880s cohort is shown in Figure 11. The confidence region reveals the strong dependencies between the different parameters. Though a parametrization with such dependant parameters is unsatisfactory from a mathematical point of view, the dependencies might be interesting in biological terms. Not the two mutation rates

*ν*and

*μ*seem to compete, but rather the net growth rate

*γ*and the two mutation rates. So at the extremes of the confidence region, we have models with high mutation rates but low proliferation of initiated cells, or models with low mutation rates but large cell growth. Note that the corresponding hazard curves are markedly different.

## Discussion

The *γ*-frailty model we fitted to our data is such that only one parameter involved in promotion is allowed to vary between population subgroups. Such a model would suggest a process where initiated cells are created in all individuals according to the same dynamics, but only in a small subgroup of the population promotion and malignant transformation happen. This is consistent with the fact that promotion depends on stimuli that might be present only in a fraction of the population. The proportion of high risk individuals, estimated by ${\stackrel{\u02c6}{\pi}}_{u}$, reflects the change of the hazard curves between the 1880s and the 1920s cohorts. The sharp increase of maximal incidence in a relatively short period of time must be due to environmental factors such as occupational exposure and smoking.

We got satisfactory fits only when we allowed for two clearly separated population subgroups with a low risk group that runs a risk close to zero. This is consistent with the results reported by the other research groups that introduced frailty into carcinogenesis modeling. In [10] the estimated fraction at risk is very low. And also in [8] the estimated proportion of susceptibles is lower than 0.5%, though these authors worked with Scandinavian data on testicular cancer.

Besides the *γ*-frailty model we considered here, one could clearly build mixture models using other parameters. In particular the number of mutations for initiation, *n*, the rate of malignant transformation, *μ*, or the initiating mutation rate *ν* would be natural choices. The increase in lung cancer rates that was observed during the 20^{th}century coincides with an increase in the fraction of smokers. Smoking might very well influence *μ* and *ν*. However, when applying such models to our data, no new issues arise, and we omit a detailed discussion here. Still, one needs to realize that all the mentioned models can fit the data equally well. This is not surprising, since we fit a relatively complex model to a very simple data structure. However, in all approaches we tested, the data suggested two component mixtures with a small high risk group and a large quasi immune group.

## Conclusion

We have studied an extension of the multistage carcinogenesis model by mixture. This allowed us to introduce population heterogeneity. The multistage model is a mechanistic model and all its parameters have a biological interpretation. Therefore, it is natural to introduce the notion of frailty in a biologically meaningful way. Such an approach is given by our mixture models, which can reproduce observed human lung cancer incidence data very accurately. Very good fits are achieved with very simple, two component mixtures also in cases where a continuous distribution might seem more adequate. However, the peak observed in the population hazard rates can be reproduced by continuous mixture models only when the population is clearly separated into a high risk and a low risk group. In other terms, the density of such a distribution would typically be bathtub shaped, closely resembling two component mixtures. Biological systems are often buffered. Small changes in the environment have no significant effect, and only after passing over some threshold value may abrupt changes in the system occur. Since we consider a late end-point, namely cancer, in a very complicated system, it is not surprising that we obtain in Section 4 mixture models that reflect such a buffering. It would be interesting to link the model with concrete biological mechanisms that are able to explain flip-flop processes. This would be an approach to understand how heterogeneity acts upon human carcinogenesis.

## Declarations

### Acknowledgements

We thank the editor and the referees for thoughtful comments and their help in improving the paper. The first author has been supported by a grant from the Swiss National Science Foundation.

## Authors’ Affiliations

## References

- Nordling CO: A new theory on the cancer-inducing mechanism. Brit J Cancer. 1953, 7: 68-72.PubMed CentralView ArticlePubMedGoogle Scholar
- Armitage P, Doll R: The age distribution of cancer and a multi-stage theory of carcinogenesis. Brit J Cancer. 1954, 8: 1-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Moolgavkar SH, Venzon DJ: Two-event models for carcinogenesis: Incidence curves for childhood and adult tumors. Mathematical Biosciences. 1979, 47: 55-77. 10.1016/0025-5564(79)90005-1.View ArticleGoogle Scholar
- Moolgavkar SH, Knudson A: Mutation and cancer: A model for human carcinogenesis. J Nat Cancer Inst. 1981, 66: 1037-1052.PubMedGoogle Scholar
- Kopp-Schneider A: Carcinogenesis models for risk assessment. Statistical Methods in Medical Research. 1997, 6: 317-340. 10.1191/096228097672462000.View ArticlePubMedGoogle Scholar
- Edler L, Kitsos CP, (Eds): Recent Advances in Quantitative Methods in Cancer and Human Health Risk Assessment. 2005, Wiley Series in Probability and Statistics, West Sussex, England: John Wiley & SonsGoogle Scholar
- Tan WY, Singh KP: A mixed model of carcinogenesis with applications to retinoblastoma. Mathematical Biosciences. 1990, 98: 211-225. 10.1016/0025-5564(90)90125-I.View ArticlePubMedGoogle Scholar
- Aalen OO, Tretli S: Analyzing incidence of testis cancer by means of a frailty model. Cancer Causes and Control. 1999, 10: 285-292. 10.1023/A:1008916718152.View ArticlePubMedGoogle Scholar
- Moger TA, Aalen OO, Halvorsen TO, Storm HH, Tretli S: Frailty modelling of testicular cancer incidence using Scandinavian data. Biostatistics. 2004, 5: 1-14. 10.1093/biostatistics/5.1.1.View ArticlePubMedGoogle Scholar
- Morgenthaler S, Herrero P, Thilly WG: Multistage carcinogenesis and the fraction at risk. Journal of Mathematical Biology. 2004, 49 (5): 455-467. 10.1007/s00285-004-0271-9.View ArticlePubMedGoogle Scholar
- Kopp-Schneider A, Portier CJ, Sherman CD: The exact formula for tumor incidence in the two-stage model. Risk Analysis. 1994, 14 (6): 1079-1080. 10.1111/j.1539-6924.1994.tb00078.x.View ArticlePubMedGoogle Scholar
- Zheng Q: On the exact hazard and survival functions of the MVK stochastic carcinogenesis model. Risk Analysis. 1994, 14 (6): 1081-1084. 10.1111/j.1539-6924.1994.tb00079.x.View ArticlePubMedGoogle Scholar
- Moolgavkar SH, Luebeck G: Two-event model for carcinogenesis: Biological, mathematical, and statistical considerations. Risk Analysis. 1990, 10 (2): 323-341. 10.1111/j.1539-6924.1990.tb01053.x.View ArticlePubMedGoogle Scholar
- Hoogenveen R, Harvey J, Andersen M, Slob W: An alternative exact solution of the two-stage clonal growth model of cancer. Risk Analysis. 1999, 19: 9-14.Google Scholar
- Heidenreich WF: On the parameters of the clonal expansion model. Radiat Environ Biophys. 1996, 35: 127-129. 10.1007/BF02434036.View ArticlePubMedGoogle Scholar
- Hanin LG, Yakovlev AY: A nonidentifiability aspect of the two-stage model of carcinogenesis. Risk Analysis. 1996, 16 (5): 711-715. 10.1111/j.1539-6924.1996.tb00819.x.View ArticlePubMedGoogle Scholar
- Heidenreich WF, Luebeck EG, Moolgavkar SH: Some properties of the hazard function of the two-mutation clonal expansion model. Risk Analysis. 1997, 17 (3): 391-399. 10.1111/j.1539-6924.1997.tb00878.x.View ArticlePubMedGoogle Scholar
- Sherman CD, Portier CJ: The two-stage model of carcinogenesis: overcoming the nonidentifiability dilemma. Risk Analysis. 1997, 17 (3): 367-374. 10.1111/j.1539-6924.1997.tb00875.x.View ArticlePubMedGoogle Scholar
- Luebeck EG, Moolgavkar SH: Multistage carcinogenesis and the incidence of colorectal cancer. PNAS. 2002, 99 (23): 15095-15100. 10.1073/pnas.222118199.PubMed CentralView ArticlePubMedGoogle Scholar
- Teicher H: Identifiability of finite mixtures. Annals of Mathematical Statistics. 1963, 34: 1265-1269. 10.1214/aoms/1177703862.View ArticleGoogle Scholar
- Herrero-Jimenez P: Determination of the historical changes in primary and secondary risk factors for cancer using U.S. public health records. PhD thesis. 2001, MITGoogle Scholar
- MIT epidemiology database pages. Http://epidemiology.mit.edu/Http://epidemiology.mit.edu/
- Herrero-Jimenez P, Tomita-Mitchell A, Furth EE, Morgenthaler S, Thilly WG: Population risk and physiological rate parameters for colon cancer: The union of an explicit model for carcinogenesis with the public health records of the United States. Mutation Research – Fundam Molec Mechan Mutagenesis. 2000, 447: 73-116. 10.1016/S0027-5107(99)00201-8.View ArticleGoogle Scholar
- Hoem JM: On the statistical theory of analytic graduation. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Theory of statistics. 1972, Berkeley, Calif.: Univ. California Press, I: 569-600.Google Scholar
- Hoem JM: The statistical theory of demographic rates. A review of current developments. Scand J Statist. 1976, 3 (4): 169-185. [With discussion by Niels Keiding, Hannu Kulokari, Bent Natvig, Ole Barndorff-Nielsen, Jørgen Hilden and a reply by the author].Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.