A parametric method for cumulative incidence modeling with a new four-parameter log-logistic distribution

Background Competing risks, which are particularly encountered in medical studies, are an important topic of concern, and appropriate analyses must be used for these data. One feature of competing risks is the cumulative incidence function, which is modeled in most studies using non- or semi-parametric methods. However, parametric models are required in some cases to ensure maximum efficiency, and to fit various shapes of hazard function. Methods We have used the stable distributions family of Hougaard to propose a new four-parameter distribution by extending a two-parameter log-logistic distribution, and carried out a simulation study to compare the cumulative incidence estimated with this distribution with the estimates obtained using a non-parametric method. To test our approach in a practical application, the model was applied to a set of real data on fertility history. Conclusions The results of simulation studies showed that the estimated cumulative incidence function was more accurate than non-parametric estimates in some settings. Analyses of real data indicated that the proposed distribution showed a much better fit to the data than the other distributions tested. Therefore, the new distribution is recommended for practical applications to parameterize the cumulative incidence function in competing risk settings.


Background
In medical research with time-to-event data, there may be more than one final outcome of interest, and this circumstance can complicate the statistical analysis. In such cases, events other than the desired one(s) are considered as competing risks when their occurrence prevents the event of interest [1,2]. An important quantity in competing risk settings is the cumulative incidence function (CIF), which makes it possible to calculate the probability of a particular event. In contrast, the cause-specific hazard function (CSHF) calculates the instantaneous rate of the event. For example, in fertility studies in women, researchers are interested in calculating the cumulative live birth rate in the presence of competing risks over time. Competing events, such as the probability of stillborn fetuses or abortions, can be calculated.
Most competing risk analyses of CIF are estimated non-or semi-parametrically [3,4]. However, the parametric model is another available approach for modeling CIF. The advantage of parametric methods compared to non-and semi-parametric ones is that if a parametric model is selected correctly, it can predict the probability of the occurrence of events in the long term and provide additional insights about the time to failure and hazard functions [5]. Also, when the survival pattern follows a particular parametric model, the estimates from true model fit are usually more accurate than the non-parametric estimates.
The best known distributions for modeling CIF are the Weibull and Gompertz distributions. However, these are suitable only for hazard functions that increase or decrease monotonically; they are inadequate when the hazard function shape is unimodal. In such cases, simple distributions such as the two-parameter log-logistic or log-normal distributions are likely to be better choices. One approach to the construction of flexible parametric models is to add a shape parameter to provide a wide range of hazard shapes and improve the models in survival data. In 1996, Mudholkar et al. proposed a generalized Weibull family with a range of hazard shapes [6] and Foucher et al. in 2005 applied this distribution in semi-Markov models [7]. In 2006, Sparling et al. presented a three-parameter family of survival distributions that included the Weibull, negative binomial, and log-logistic distributions as special cases [8]. These distributions can fit U-shapes or unimodal shapes for the hazard function, and therefore can be appropriate for survival data.
In light of the issues summarized above, a more efficient parametric distribution with various shapes of hazard patterns would appear to be useful for estimating CIF in competing risk situations. In recent years, various parametric distributions have been developed specifically for analyzing competing risk data that offer more flexibility. For example, in 2006 Jeong introduced a new parametric distribution for modeling CIF [5]. In 2009, Wahed et al. developed Weibull's distribution, resulting in a beta-Weibull four-parameter distribution for use in competing risks [9]. Here, we propose a new four-parameter log-logistic distribution by extension of a two-parameter log-logistic distribution that contains different kinds of hazard shapes in survival data and increases the efficiency of the CIF over the non-parametric approaches. Also, this is an improper distribution which enjoys more flexibility for modeling of CIF. Therefore, it would be suitable for competing risk models. We have performed a simulation study to compare CIF estimates obtained with the four-parameter distribution and a nonparametric method. After using simulated data to assess the method, we analyzed a real data set to examine the efficiency of our proposed distribution.

Introduction of the new distribution
The survival function according to a two-parameter log-logistic distribution is as follows: where l > 0 and τ > 0 are the scale and shape parameters, respectively. If τ ≤ 1, the hazard function decreases monotonically, whereas if τ > 1, the hazard function is unimodal [10].

Survival function of the four-parameter log-logistic distribution
The two-parameter log-logistic distribution is expanded on the basis of the family of Hougaard stable distributions, whose survival function is as follows: where H is the cumulative hazard function [11]. If a two-parameter log-logistic cumulative hazard function is used instead of H, we obtain a new distribution that is improper. In addition, to reduce the number of parameters, the substitution υ = θ 2-α is used [12]. The survival function of the new distribution is constructed as: where the parameter space is θ > 0, l > 0, τ > 0, -∞ <a < ∞. The survival function must be between zero and one, as shown in the Appendix. If a < 0, the survival function is improper. This is an important characteristic of CIF modeling that differs from the two-parameter log-logistic distribution and other distributions.

Hazard function
The hazard function can be directly obtained from equation (3), as: Because of the complexity of this hazard function formula, there is no simple mathematical expression for different types of hazard function. The flexibility of the hazard function is shown in Figure 1. Compared to the two-parameter model, the four-parameter log-logistic distribution has a flexible hazard function that can be monotonically decreasing or increasing, unimodal, or U-shaped.

Cumulative incidence function
Competing risks data are represented as a pair (T, δ) where δ is the indicator variable, defined as δ = 0 if the observation is censored, and as δ = 1,2,...,K where K is the number of competing events. T is the time to first event or censoring. The two major quantities in the analysis of competing risks data are CSHF and CIF. The CSHF rate for event k is the instantaneous event rate for an individual who experiences event k at time t given that the subject experiences no other type of event up to t. The CIF for event k, F k (t) = P(T ≤ t, δ = k), is the cumulative probability of observing event k by time t. The CIF for event k is defined as follows: where S(u) = P(T > u) and h k (u) is the hazard function for the kth cause-specific event. In the literature, parametric methods are proposed to estimate CIF with the CSHF method [5,9,13]. Here we have also used the CSHF method to model CIF.
To estimate the CIF non-parametrically, the overall survival function should be replaced with the Kaplan-Meier estimate and the cause-specific cumulative hazard function with the Nelson-Aalen estimate [3].

Estimation method
For convenience, we have assumed throughout this paper that there were two events: the desired event k = 1 and a competing event k = 2; and that n is the sample size. Because the two event are mutually exclusive, the overall survival function factored into a product of two cause-specific survival functions, i.e. S(t, ψ) = S 1 (t,ψ 1 ) S 2 (t, ψ 2 ). Therefore, the likelihood function of the parametric inference is constructed as: where ψ k = (l k , τ k , θ k , a k ) is the parameter vector for event k, S k (t, ψ k ) is the survival function for event k, and f k (t, ψ k ) is the density function of event k based on a fourparameter log-logistic distribution.
According to the invariant property of the maximum likelihood estimate (MLE), the CIF is estimated by substitutingψ in expression (5), which yieldŝ

Simulation study
A simulation study was used to compare the cumulative incidence estimate of the proposed distribution with a three-parameter distribution proposed by Sparling [8] and the non-parametric method at different times. As described by Beyersmann in 2009, we first simulated survival times T with all-cause hazards h 1 (t) + h 2 (t) on the basis of a two-parameter log-logistic distribution, with l 1 = 0.3, τ 1 = 2.97 for the event of interest and l 2 = 0.03, τ 2 = 1.1 for the competing event (based on fertility data). The event type was then determined by a binomial experiment with probability h 1 (t)/(h 1 (t) + h 2 (t)) on event type 1 [15,16]. Additionally, we generated censoring times with a binomial experiment. The data sets were simulated with sizes n = 1000, and a 7% censoring level. Using the data thus produced, we applied the four-parameter log-logistic, Sparling distributions, and non-parametric method to these data. Accordingly, 1000 samples were generated and the bias and empirical mean square error (MSE) of the CIF at time t were calculated as follows: where F 1 (t) is the true value of CIF at time t [17].
To test the efficiency of the parametric distribution proposed here, we used another simulation study. Failure times were generated on the basis of a two-parameter Weibull distribution with k 1 = 1.4, p 1 = 0.45 for the event of interest and k 2 = 1.04, p 2 = 0.03 for the competing event. We used the same method to fit the new distribution to these data.
The maximum likelihood estimates of the parameter vectors were calculated by PROC NLMIXED in SAS v. 9.1, and the non-parametric estimate of CIF was obtained with the "cuminc" R function from the "cmprsk" library. Because the determination of a suitable initial value to fit the models is an important problem in numerical studies, many initial values were examined to find a suitable convergence. Table 1 summarizes the results of the first simulation in which the four-parameter loglogistic, Sparling distribution and non-parametric methods were fit for different times with n = 1000. The results showed that the bias and MSE of the CIF estimates obtained with the four-parameter method for the event of interest at t = 1.25 to t = 2 were smaller than with the Sparling distribution and the non-parametric method. For the competing event, the bias and MSE of the CIF estimates were lower than with the non-parametric method.

Results
The results of the second simulation are summarized in Table 2. Up to t = 1.5, the bias and the MSE of the CIF estimates obtained with the non-parametric method for the event of interest were lower than with the four-parameter method, but after t = 2, the bias and MSE of the CIF estimates for the competing event with the new distribution were equivalent or slightly lower than with the non-parametric method. For the competing event, the bias and MSE of the CIF estimates were lower than with the non-parametric method at all times.
In summary, these two simulations indicate that the four-parameter modeling of CIF was as efficient as the non-parametric method and the Sparling distribution and sometimes led to better estimates of CIF. Moreover, the four-parameter log-logistic model performed well under a Weibull distribution. The true model is a two-parameter log-logistic distribution.

Example: women's fertility history
We tested the proposed distribution on a set of real data. In a cross-sectional study, the fertility history of 858 women aged 15-49 years in rural areas of the Shiraz district (southwestern Iran) was reviewed (unpublished data). The women were selected by multistage random sampling from a list of villages in 2008. Only the first pregnancy of each woman was included in this study. A self-administered questionnaire regarding fertility history was used. After women with an undesired first pregnancy were excluded, the final sample consisted of 652 women. Live birth as a result of the first delivery was our desired event, and a stillborn fetus or abortion was the competing event. The event time was defined as the interval between marriage and a live birth, a competing event or censoring. Also, women who had not given birth on the date of interview (7% in this data set) were censored. The estimated cumulative incidence of live births and abortions or stillborn fetuses based on the two-and four-parameter log-logistic, Weibull, Gompertz and Sparling distributions and the non-parametric estimates are shown in Figure 2. Up to time t = 3, the cumulative incidence of live births increased rapidly; thereafter, cumulative incidence tended to plateau. This means that the probability of live births during the first four years after marriage increased rapidly, and remained approximately constant thereafter. The curves also show that the four-parameter log-logistic distribution was closer to the non-parametric estimate than the other distributions at all times. For shorter intervals since marriage, the two-parameter log-logistic and Sparling distributions were closer to the non-parametric estimates than to the Weibull and Gompertz distributions. After t = 5, all distributions were close to the observed data. Table 2 The results of parametric and non-parametric estimates of CIF based on a fourparameter log-logistic simulation for different times.  Table 3 shows the Akaike information criterion (AIC), Bayesian information criterion (BIC) and estimated cumulative incidence for two events in different times. Based on AIC and BIC criteria, the four-parameter log-logistic model with the lowest AIC and BIC showed a better fit to the data than the two-parameter log-logistic, Sparling, Weibull or Gompertz distributions. Because the two-parameter log-logistic distribution is nested within the Sparling and the four-parameter log-logistic distributions, we can compute likelihood-ratio chi-square statistics to test the fit of the nested models. The likelihood-ratio chi-square statistics and their corresponding p-values are:  Table 3 The Akaike information criterion (AIC), Bayesian information criterion (BIC) and the estimates of the cumulative incidence function under competing risks based on different distributions with the non-parametric method.  These results confirm the findings in Figure 2, and again indicate that the proposed distribution shows a closer fit to the observed data than the other distributions to which it is compared.

Discussion
Although non-parametric methods such as the Kaplan-Meier approach are widely used in survival analysis and may show a very close fit to the data, they do not provide additional information about the nature of the data. Therefore, in this study our ultimate aim was to develop a new parametric distribution by extension of the two-parameter log-logistic distribution. The addition of third and fourth parameters allows the model to capture U-shaped hazards.
Our simulation study showed that the parametric estimate of CIF with the new distribution was slightly less biased and had a smaller MSE than the estimate obtained using non-parametric methods. Simulations with the two-parameter log-logistic and Weibull distributions showed that our proposed four-parameter distribution had appropriate efficiency. Also, analyses of real data indicated that the proposed distribution showed a much better fit to the data than the other distributions tested. Our results are consistent with other studies in finding that an appropriate parametric model yields more precise estimates of cumulative incidence than non-parametric methods, and is thus a potentially suitable way to describe quantities of competing risks [9,18]. In contrast, if a parametric model is mis-specified, the quantities will be estimated incorrectly, which will clearly bias the inferences [12]. However, our proposed distribution captures various hazard shapes well, which extends its applicability to a variety of survival data.
In addition to this advantage, the proposed distribution is improper for a < 0. This property makes our proposed distribution superior to other distributions such as the Weibull, two-parameter log-logistic, three-parameter Sparling and generalized Weibull models [6,8]. This characteristic of our distribution also makes it possible to evaluate the direct effect of covariates on CIF, which is not possible in the CSHF model [19,20]. The potential applications of direct modeling of CIF and parametric regression models with the four-parameter log-logistic distribution will be examined in forthcoming papers.

Conclusions
Despite the complexity of this distribution for modeling CIF (which is one of its limitations), the results of our simulation study and real-data application show that the new distribution achieves a much better fit to the data than other distributions that use fewer parameters. Whereas the two-parameter log-logistic is a proper distribution, the four-parameter log-logistic is an improper distribution in the subset of parameter space. Therefore, this distribution is suitable for parameterizing CIF directly in competing risk models. Moreover, it is can be added to a family of distributions and also potentially useful for parameterizing survival data in general.