Theoretical Biology and Medical Modelling Open Access Heterogeneity in Multistage Carcinogenesis and Mixture Modeling

Carcinogenesis is commonly described as a multistage process, in which stem cells are transformed into cancer cells via a series of mutations. In this article, we consider extensions of the multistage carcinogenesis model by mixture modeling. This approach allows us to describe population heterogeneity in a biologically meaningful way. We focus on finite mixture models, for which we prove identifiability. These models are applied to human lung cancer data from several birth cohorts. Maximum likelihood estimation does not perform well in this application due to the heavy censoring in our data. We thus use analytic graduation instead. Very good fits are achieved for models that combine a small high risk group with a large group that is quasi immune.


Introduction
Cancers can arise in virtually any part of the body, and although there are many tissue specific properties, a general multistage framework for carcinogenesis holds for most cancer types. More precisely, cells must undergo an evolutionary process involving several stages and leading finally to a cell that has completely lost proliferation control. In a first step, called initiation, mutations transform stem cells into intermediate states. Such initiated cells may give rise to pre-neoplastic lesions via accelerated growth. Eventually, a cell out of such a clone may experience further mutations and be transformed into a malignant tumor cell. This second step comprising clonal expansion and final malignant transformation is commonly called promotion. This multistage scheme shows the inherently random aspect of carcinogenesis: mutations happen at random times and stochastic growth processes are involved.
Mathematical models of carcinogenesis have been studied for about fifty years. Some of the earliest attempts to build biologically based quantitative descriptions are [1] and [2], who explained cancer as the result of a sequence of mutations. A widely accepted model was proposed in [3] and [4]. Their two-stage clonal expansion (TSCE) model was explicitly formulated in terms of an initiation stage and a promotion stage. This approach stressed the importance of both mutations and clonal expansion in the process leading to cancer. The TSCE model has found many applications and extensions. One example is the multistage model, which takes up the same structure but allows for more than two stages. Due to this long and evolving story, we should not have in mind a single model when talking about the multistage model. We should rather have in mind a cascade of nested models that starts from a fundamental idea and incorporates through its evolution more and more biological detail. Excellent reviews of stochastic carcinogenesis modeling can be found in [5] and [6].
One part of the recent extensions tries to take population heterogeneity into account. Such heterogeneity can result from sources such as genetic variation, exposure to carcinogens due to either changes in environment or occupa-tion, and differences in lifestyle (the most prominent factors being smoking and diet). In [7] a mixture of a one stage model and a two stage model was used to describe the heritable and the sporadic form of Retinoblastoma, a cancer of the eye caused by mutations in a single tumor suppressor gene. Other approaches incorporate heterogeneity via standard frailty modeling, where the common baseline hazard h 0 (t) is multiplied by a non-negative random variable Z in order to model the individual hazard h ind (t) = Zh 0 (t), see for example [8], and [9] for such an approach.
In this text, we take up the work by [10]. These authors introduce two new population parameters to describe heterogeneity. The first one, called the fraction at risk F, is used to distinguish between susceptibles and a postulated group of immune individuals. The second one, called the fraction of deaths due to cancer among all deaths due to either cancer or related competing causes f, models competing related risks. They fit their model to US lung cancer incidence data from several birth cohorts. The parameters F and f present an abstract way describe population heterogeneity and are not linked to a specific biological process. Therefore, the above mentioned authors state that other modeling strategies could be tested. The present work gives such an attempt. We take up the same multistage model, but we will use mixture models to allow for variability among individuals. This allows us to introduce heterogeneity in a biologically meaningful way.
In the next section, we describe the multistage carcinogenesis model and introduce an extension by mixture. Then we will give a series of identifiability results for both the multistage model and some mixture models. Finally, we apply the model to human incidence data before giving some concluding remarks.

The Multistage Carcinogenesis Model
We will work with a simplified version of the multistage model, but one that is general enough to incorporate the two main features of the carcinogenesis process: the sequence of mutations and the clonal expansion. We make the following assumptions: 1. A cell must undergo n mutational events to get initiated.
2. The number of cells at risk, N 0 , is constant over time.
3. The number of newly generated initiated cells is a (nonhomogeneous) Poisson process with intensity λ I (t).
4. An initiated cell gives rise to a clonal expansion according to a birth-and-death process with emigration, i.e. in a short time interval (t, t + Δt) an initiated cell divides in two initiated cells with rate β, dies or differentiates with rate δ(<β), and divides into one initiated and one malignant cell with rate μ. 5. Once a promoted cell is generated, its growth is deterministic, and we neglect the time needed to grow to detectable tumor size. 6. The system starts with all at risk cells in the normal state and the different cells act independently of one another.
The model is shown schematically in Figure 1. Note that the above assumptions are standard in carcinogenesis modeling, and the possibility of generalization (for example to time dependent N 0 ) has been discussed by several authors. However, we choose this simplified version in order to limit the complexity of our baseline model. Note also that for n = 1 we get the classical TSCE model. The multistage carcinogenesis model Figure 1 The multistage carcinogenesis model. N 0 denotes the number of normal stem cells. To get initiated, a normal cell accumulates n consecutive mutations, where ν denotes the mutation rate per cell per year for the gene in question. The number of cells having k mutations is noted I k , 1 ≤ k ≤ n. The fully initiated cells, I n , expand according to a birth-and-death process. These cells give rise to tumor cells T if a further event happens, and μ/(μ + β) can be interpreted as the probability of such a malignant transformation during a cell division.
A detailed discussion and derivation of the survivor and hazard functions of this multistage model can be found in [11,12] and [10]. These authors show that the survivor function for tumor onset can be represented as In this expression is the intensity of initiation. The function F P (x) is the cdf for the waiting time for the first malignant transformation within a clone starting with one initiated cell at time 0. This cdf is improper since a clone of initiated cells dies out with a probability greater than 0 if δ > 0. Its exact form is Researchers have expressed concern about the approximations used in carcinogenesis models. This issue was raised in a review paper by [13] and has inspired [11] and [14]. Our formula above is exact and uses the method based on integration cited by [ [11], top of p. 1080]. The remaining simplifications in our model, in particular the constancy of N, are for convenience and do not affect the conclusions of the paper. As a general comment, it should be noted that the term two-stage refers to different things in different papers. It is, for example, possible to model clones via compartments or via branching processes. Both may employ the same parameter notation, but the interpretation will be quite different. Care has to be taken, if one wishes to stay close to biological reality. For further comments, see [ [13], section 2.1].
The hazard function can be easily calculated from the survivor function, h(t; n) = -d log S(t; n)/dt. In order to deduce the asymptotic behavior of the hazard for several n, we note that h(t; n) can be written in terms of a recursion, h(t; 1) = νN 0 F P (t) and h(t; n) = for n ≥ 2. Therefore, when n = 1 the hazard levels off as t goes to infinity. More precisely, the hazard of the TSCE model goes to the finite asymptote νN 0 ·P(a clone of initiated cells does not die out).
But the hazard grows to infinity if n ≥ 2. In both cases h(t; n) is strictly monotonic increasing with t. The unboundedness of the hazard is due to the simplifications in the model. This may lead to appreciable differences at low ages in some types of cancer or under hightened exposure. It is typically of lesser importance in human studies.
The monotonicity properties of hazard curves are not in agreement with observed incidence curves from human population data. Such data typically shows very low incidence up to about the age of fifty, a sudden and sharp increase between about the ages of fifty and eighty, and a subsequent leveling off and a decrease for the very old. This behavior at old ages is not captured by the hazard curves h(t; n). However, it can be modeled very easily by incorporating a frailty effect as we will show in the application later on.

Extension of the Model by the Use of Mixture Distributions
An observed human population is heterogeneous. Though the process of cancer development is similar for everyone, parameters may vary between individuals. Since all parameters of the model are biologically meaningful, we aim at modelling heterogeneity directly through these parameters. We thus propose to consider some of the biological parameters -one at a time -as random variables.
Let θ be such a parameter and let G(θ) be a distribution function for θ. Then, we will denote by S(t|θ) the survivor function of the multistage model (1) for a given value θ, whereas the population survivor function is Under certain regularity conditions an analogous representation holds for the hazard function The distribution function G must then be selected based on the biological parameter θ chosen. If we consider for example θ = n, the number of mutations needed for initiation, it is natural to choose a finite distribution, i.e. P(θ = n i ) = π i for a fixed set {n 1 ,...,n g } ⊂ such that .
This would correspond to g population subgroups having inherited different numbers of initiating mutations. The model could also be interpreted as a multiple pathway model, where g pathways involving different numbers of mutations can lead to cancer. Other interesting choices focus on promotion, for example θ = βδ, the growth advantage of initiated cells, or θ = μ, the promotion rate.
These two cases would be consistent with both a finite or a continuous distribution as long as its support is contained within biologically reasonable bounds.

Identifiability
Before fitting our mixture model to observed incidence data, the identifyability issue has to be considered. The parameters of the TSCE model cannot be uniquely determined based on incidence data. Some of the papers relevant to this issue are [15,16], and [17].

Identifiability of the Multistage Model
The parameters of our base model (Eq. 1, 2, 3) are n, the number of initiating mutations, and ψ = (N 0 , ν, β, δ, μ), sizes and rates. When fitting S(t|n, ψ), the survivor function of the multistage model given in (1), these six parameters cannot all be fitted separately. We will show that n and the follwing three combinations are, however, uniquely determined In order to deal with the discrepancy between the six parameters and the four identifyable features, we will hold the two parameters N 0 and δ fixed (see also [18]). To determine N 0 = Number of stem cells in a given tissue some preliminary biological estimate is needed. For the death rate we mainly focus on the choice δ = 0, which implies γ = βδ = β, so that β simply describes the growth advantage of initiated cells.

The Number of Mutations for Initiation
Is n identifiable in the multistage model or could a change in n be compensated by some adjustment of the biological parameters ψ so that finally the same survivor function resulted? The answer to this question is no, because the behavior of S(t|n, ψ) at the origin is enough to determine n.
Proof: Direct calculation shows that This result is mainly of theoretical interest and cannot be used to estimate n. In practice, for some tissues one can fix n according to the available biological theory. For example in colon cancer it is commonly assumed that two mutations are necessary for initiation, followed by a third one for malignant transformation (see [19]). In cases where no biological reasoning is available, we suggest to fit the model for several choices of n. The form of the intensity of initiation given in expression (2) shows that estimates for ν are highly sensitive to the choice of n.
Results within biologically reasonable limits will thus be obtained only for very few values n.

Growth and Mutation Rates
For n = 1 it has been shown in [16] that three functions of ψ are uniquely determined by S(t|n = 1, ψ). Their proof can be generalized to n ≥ 1. This is intuitively plausible. Given n, the intensity of initiation depends on the product N 0 ν n , but not on N 0 and ν individually. And the speed at which a clone of initiated cells grows depends only on the difference βδ, but not on the actual pair β, δ. First, we transform I(t; n, ψ) via (n -1) repeated integrations by parts into a n-fold integral. Next, we differentiate this expression n times with respect to t, to obtain Application of these two steps to both sides of (5) proves the result.

Identifiability of the Mixture Structure
Besides the parameters of the multistage model itself, we must also investigate the identifiability of the newly introduced mixture structure. Let be a family of distribution  Family is said to be identifiable with respect to , if holds for all G 1 , G 2 ∈ . In other words, the population survivor function must uniquely determine the underlying mixing distribution within a pre-specified family.
This condition turns out to be hard to verify in general settings and we must focus on special cases. A very useful result was given in [20] for finite mixtures. Let Θ be a set

Initiation
The multistage model we consider here describes initiation as a sequence of discrete events, namely rate limiting mutations, which lead to a cell capable of accelerated growth. A biological mechanism generating heterogeneity at this stage are germ line mutations of the genes involved, leading to individuals starting life with all cells in an intermediate stage. Mathematically, this means that the population survivor function is The next proposition shows that such a mixture is identifiable.
Proof: We show that condition (6) is satisfied. The initiation incidence rates can be written recursively Thus, for , Promotion is a complicated process and both genetic and epigenetic factors seem to be involved. Therefore, heterogeneity can be due to many different mechanisms. In the context of the multistage model, there are two main parameters these agents can influence: the growth advantage of initiated cells, γ, and the rate of malignant transformation, μ.
We can derive a result similar to the one in the previous section. Let there be a discrete set of γ-values 0 <γ 1 <γ 2 < ... . Note that we consider δ = γ i -β i as fixed, i.e. we assume in fact that there is an analogous sequence of β i . From now on, we will write ψ for the parameter vector (n, N 0 , ν, δ, μ).

Proposition 4. The family of finite mixture models induced by
Proof: We will first check condition (6) in the case δ > 0.
We have and the assumption δ > 0 implies that F P is improper and converges to a limit a(γ, δ) < 1 as t → ∞. The value 1 -a(γ, δ) is the probability that a clone of initiated cells (generated by a single initiated cell at time t = 0) eventually dies out without ever giving rise to a promoted cell. The assumption γ i+1 > γ i implies that a(γ i+1 ,δ) > a(γ i , δ), and as a consequence .

Fitting Mixture Models
We will now apply the proposed mixture model to the human lung cancer incidence data from [10]. These authors have studied mortalities due to lung cancer in different birth cohorts of European Americans, namely those born in the 1880s, 1890s, 1900s and 1920s. The data comes in form of a vector where r i counts the population at risk and o i counts the observed cancer cases during the time interval [t i , t i+1 ). The data is discussed in [21] and is publically available ( [22]). Additional information is given in [23].
In our case, the data is grouped into 5-year age groups: 0-4 years, 5-9 years, 10-14 years and so on. Figures 2 and  3 show the raw hazard estimates As mentioned earlier, the observed hazard has a peak at around 80 years and a decrease for the higher ages, while the hazard of the multistage model given by (1) is strictly monotonic increasing. Estimation of the parameters by analytic graduation as described later on leads to extremely poor fits, which for all ages above 30 and for all birth cohorts give quite useless predictions. The fault does, however, not lie with the methods of estimation but rather with the model. Thus, using the inverse of the vari- Observed incidence (males) Figure 2 Observed incidence (males). Observed lung cancer incidence rates in the United States for four birth cohorts. The population considered are the males of European descent. ance of the estimated incidence rates as weights leads to almost the same poor fit. The unmixed multistage model does not succeed in describing the incidence rates in Figures 2 and 3. We will come back to this failure.
We will use suitable transformations to respect these constraints.

Maximum Likelihood Estimation
We treat the failures from competing causes as right censorings. This means for each time interval [t i , t i+1 ) we observe o i failures due to cancer and have c i = r i -r i+1 -o i censored individuals. Under the assumption of independent and uninformative censoring, the likelihood function L(π l , ν, γ u , μ|N 0 , δ, n, γ l ) is given by By numerically optimizing this likelihood, we observe a strange behavior of the MLE. Figure 4 shows the data from the males 1880s cohort along with the models corresponding to the MLE, the least squares fit LSE, and the starting value of the numerical optimization. As we can see, the MLE fails completely to catch the behavior of the observed incidence at old ages; only the first few data points are well fitted. Convergence to this model seems even more astonishing when we consider the initial model. The chosen starting value is far away from the data in terms of fit, but it is close to the observed hazard in terms of shape. Furthermore, the model corresponding to the LSE fits the observed hazard very closely. This shows that the parametric family we apply to the data does indeed contain models that can fit. But in this example, likelihood and fit do not measure the same thing. The huge discrepancy, however, is intriguing. The strange behavior of the MLE is caused by several effects. One aspect is model mis-specification in relation with the special metric used in likelihood based inference. The data is not really generated by our multistage model, while the MLE corresponds to the survivor function that minimizes  the Kullback-Leibler distance to the observed empirical survivor function. But this is a very special metric and can produce obviously strange results in some cases.
In mechanistic modeling, likelihood based inference is often difficult due to local maxima and/or low curvature around the maxima. Both problems apply to our case. Our likelihood-surface is multimodal because the different biological parameters compete. This problem can be avoided by extensive use of the available biological knowledge. If we have good starting values and restrict attention to biologically reasonable intervals, then the likelihood surface is unimodal in that domain. The second problem is more difficult to treat. Even for identifiable parameters the likelihood surface is often extremely flat around its maximum. Figure 5 gives the contour plot of the log-likelihood for a reduced parameter space. That is, we take model (7), but fix all parameters except ψ = (logitπ l , log 10 μ). The log-likelihood essentially has a ridge starting the upper-right corner and running downwards as one moves to the left. This means that only a combination of the two parameters can be estimated precisely, but not both separately. The log-likelihood values of the estimates in Figure 5 are l( ) = -1.338·10 6 and l( ) = -1.355·10 6 . While these values appear to be close, they are in fact quite different in the likelihood metric, because A 95% confidence region determined by a likelihoodratio test is shown in Figure 6. Note how small this confidence set is. So, not only does the likelihood technology give badly fitting hasard rates, it is also overly optimistic about having found the right values.
The most important reason that leads to the failure of the MLE in our application, however, is the heavy censoring. We deal with human cancer incidence data. This means we consider a rare event, and most members of the population fail from competing causes. In the data set we are considering there are tens of millions individuals at risk at the first time points, but only some tens of thousands at the last one. In order to illustrate the impact of censoring, we will construct a sequence of artificial data sets that lead to the same raw hazard estimates, but differ in the degree of censoring. As before, we note by (r i , o i ) the real data set.
Let us define the points ( , ) bŷ Log-likelihood surface (zoomed) Figure 6 Log-likelihood surface (zoomed). Contour plot of the log-likelihood surface as shown in Figure 5. The plot zooms in on the MLE and in addition contains a 95% confidence ellipsoid.  That is we start with a population of size 10 6 and suppose that during every time interval exactly k10 4 individuals die -either due to cancer or due to competing causes. We then fit model (7) by maximum likelihood as before (we consider again the four parameters π l , ν, γ u , μ as unknown). Figure 7 gives the estimated models for k = 1,...,8. The MLE behaves better for small k than for large k.
In Figure 8 we calculated the residual sums of squares (RSS) for these models, which seem to increase exponentially with the coefficient of variation of the sequence, The above example shows that the MLE is dominated by the points corresponding to large "at risk" sets. The LSE, on the other hand, works fine, since it attributes equal weight to all age intervals. This makes one wonder whether a weighted least squares approach would suffer from the same problem as the MLE. If we give for example weights proportional to the population at risk, would the LSE break down as well? The answer to this question is clearly no. Considering Figure 4 once again, we realize that there is a model that fits all the data points very accurately. This model will be good even if we downweight the contribution to the RSS of the points at high ages. Any weighted least squares approach will select a model that is very close to the standard LSE.

Analytic Graduation
The LS estimates shown in the previous figures were obtained by analytic graduation, which is a standard procedure to fit continuous curves to discretized data. A detailed discussion of the procedure and derivation of asymptotic results can be found in [24,25].
Figures 9 and 10 show model (7) fitted to the different cohorts. The model successfully reproduces the observed data. Table 1 gives the corresponding parameter estimates.
Note that these values are conditional given N 0 , δ, γ l and achieved only as long as γ l is small enough. We set γ l = 10 -4 in the models given here.
The least squares estimates for a single unmixed multistage model in which all the parameters except N and δ are fitted, lead to curves ressembling the maximum likelihood fit in Figure 3. The estimated parameters show that the simple multistage model attempts to distinguish between the earlier and later cohorts to a large extent by increasing the initiation rate ν. The increase is three-fold between 1880 and 1920 for the females and even six-fold for the males. The other parameters remain more or less constant across the cohorts. The incidence rates in females is much lower than in males. While the mixture model adjusts to this through an adjustment of the mixture weights, the single multistage model explains it by very different estimates of the promotion rate μ between the sexes.
In order to assess the accuracy of the given estimates, we use projections of a joint confidence region rather than marginal confidence intervals. In other words, we determine a confidence region C ⊂ ‫ޒ‬ 4 , and then we look at projections of C on the six parameter plains spanned by the four parameter axis. The confidence region we get for the EAMs 1880s cohort is shown in Figure 11. The confidence region reveals the strong dependencies between the different parameters. Though a parametrization with such dependant parameters is unsatisfactory from a mathematical point of view, the dependencies might be interesting in biological terms. Not the two mutation rates ν and μ seem to compete, but rather the net growth rate γ and the two mutation rates. So at the extremes of the confidence region, we have models with high mutation rates but low proliferation of initiated cells, or models with low mutation rates but large cell growth. Note that the corresponding hazard curves are markedly different. π uνγ uμπ uνγ uμ LS fits (males) Figure 9 LS fits (males). Observed (dashed lines) and modeled (solid lines) incidence rates for the data from Figure 2. LS fits (females) Figure 10 LS fits (females). Observed (dashed lines) and modeled (solid lines) incidence rates for the data from Figure 3.

Discussion
The γ-frailty model we fitted to our data is such that only one parameter involved in promotion is allowed to vary between population subgroups. Such a model would suggest a process where initiated cells are created in all individuals according to the same dynamics, but only in a small subgroup of the population promotion and malignant transformation happen. This is consistent with the fact that promotion depends on stimuli that might be present only in a fraction of the population. The proportion of high risk individuals, estimated by , reflects the change of the hazard curves between the 1880s and the 1920s cohorts. The sharp increase of maximal incidence in a relatively short period of time must be due to environmental factors such as occupational exposure and smoking.
We got satisfactory fits only when we allowed for two clearly separated population subgroups with a low risk group that runs a risk close to zero. This is consistent with the results reported by the other research groups that introduced frailty into carcinogenesis modeling. In [10] the estimated fraction at risk is very low. And also in [8] the estimated proportion of susceptibles is lower than π u 95% asymptotic confidence region Figure 11 95% asymptotic confidence region. Projections of an asymptotic 95% confidence region (CR) for the parameters of the carcinogenesis model fit the the 1880s birth cohort of European American males. Note that we used the parametrization (logit(π l ), log(γ u -γ l ), log 10 ν, log 10 μ) in the fitting process. Transformed Parameters 0.5%, though these authors worked with Scandinavian data on testicular cancer.
Besides the γ-frailty model we considered here, one could clearly build mixture models using other parameters. In particular the number of mutations for initiation, n, the rate of malignant transformation, μ, or the initiating mutation rate ν would be natural choices. The increase in lung cancer rates that was observed during the 20 th century coincides with an increase in the fraction of smokers.
Smoking might very well influence μ and ν. However, when applying such models to our data, no new issues arise, and we omit a detailed discussion here. Still, one needs to realize that all the mentioned models can fit the data equally well. This is not surprising, since we fit a relatively complex model to a very simple data structure. However, in all approaches we tested, the data suggested two component mixtures with a small high risk group and a large quasi immune group.

Conclusion
We have studied an extension of the multistage carcinogenesis model by mixture. This allowed us to introduce population heterogeneity. The multistage model is a mechanistic model and all its parameters have a biological interpretation. Therefore, it is natural to introduce the notion of frailty in a biologically meaningful way. Such an approach is given by our mixture models, which can reproduce observed human lung cancer incidence data very accurately. Very good fits are achieved with very simple, two component mixtures also in cases where a continuous distribution might seem more adequate. However, the peak observed in the population hazard rates can be reproduced by continuous mixture models only when the population is clearly separated into a high risk and a low risk group. In other terms, the density of such a distribution would typically be bathtub shaped, closely resembling two component mixtures. Biological systems are often buffered. Small changes in the environment have no significant effect, and only after passing over some threshold value may abrupt changes in the system occur. Since we consider a late end-point, namely cancer, in a very complicated system, it is not surprising that we obtain in Section 4 mixture models that reflect such a buffering. It would be interesting to link the model with concrete biological mechanisms that are able to explain flip-flop processes. This would be an approach to understand how heterogeneity acts upon human carcinogenesis.