Theoretical Biology and Medical Modelling

Background: It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis. Results: We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approach. When it was applied to data on diffuse large B-cell lymphoma (DLBCL), we were able to identify some new candidate signatures for disease prognosis. Conclusion: This multivariate approach provides researchers with a more reliable list of molecular features profiled in terms of their prognostic relationship to the event times, and generates dependable information for subsequent identification of prognostic molecular signatures through either biological procedures or further data analysis.


Background
High-throughput biotechnologies such as microarray and mass spectrometry permit simultaneous measurements of enormous bodies of genomic, proteomic, and metabolic information to be made. Such information helps us understand the molecular basis of important clinical outcomes, and thus improves the efficiency as well as accuracy in clinical decision making. More specifically, a small subset of these molecules can be used as biomarkers in daily clinical practice for detecting disease at early stages, measuring disease progress, monitoring the efficacy of treatments, and potentially accelerating the drug discov-ery process. However, the promise of genomics, proteomics, and metabolomics in clinical medicine rests on identifying these disease-specific molecular signatures. Clinical and preclinical studies of patients' genomics and proteomics profiles usually present datasets that share common characteristics, i.e., many molecular features ("large p") collected from few individuals ("small n"). The statistical challenge is to mine prognostic signatures from thousands of candidates by efficiently extracting information from samples of limited size, i.e., "small n large p" datasets. Moreover, the clinical outcomes measured for certain patients, e.g., survival times of cancer patients, are usually censored data, which further complicates the statistical analysis. There has been extensive research on the classification and prediction of cancer using gene expression information [1][2][3], but there has been less progress in identifying individual molecules that can be used to predict the clinical outcome. We devote this paper to developing a Bayesian approach to profile molecular features on the basis of their prognostic relations to event times. The proportional hazard model [4] has a long history in modeling the association of risk factors to the right-censored event times observed in clinical study [5,6]. Through this model, it has been of special interest to develop a systematic approach to identifying molecular signatures for event times with "small n large p" datasets. However, the overwhelmingly larger number of molecular candidates compared to the number of individuals prohibits exhaustive variable selection because of the heavy computation and model-overfitting considerations. A variety of strategies have been proposed in the literature. The first is to reduce the list of genotypic candidates by univariately associating each of them with phenotypic clinical outcome [1,7], and then regress the clinical outcome on the selected candidates. The second employs principal component analysis (PCA) to build up "eigengenes" (i.e., linear combinations of genes) and associates these with phenotypic clinical outcomes, and the identification of molecular signatures is further explored on the basis of these [8]. The third strategy employs partial least squares (PLS) [9,10] to construct orthogonal "eigengenes" [11]. Other strategies have also been used to reveal interesting prognostic molecular signatures for certain event times [12][13][14][15]. Recently, Tadesse et al. [16] proposed a Bayesian error-in-variable survival model to identify genes of which the expression levels are associated with survival outcome. It is widely accepted that most genes measured in microarray experiments provide little information for predicting patient survival, so a necessary step in the analysis is to reduce the number of candidates before identifying prognostic molecular signatures with a relatively small sample. This reduction is usually carried out by ranking molecular features (either the original molecular candidates or the "eigengenes") according to either z scores [7] or Cox scores [17][18][19], which measure the univariate association of each molecular feature with the event time. Several top-ranked molecular features are further explored for their prognostic associations with the event time. As shown in our simulation study, employing the univariate Cox scores to profile molecular features can be misleading as it may miss many important candidates but select many false-prognostic ones. Indeed, molecular features with high univariate association to the event time may not necessarily predict the event time effectively when applied together. As shown by Sha et al. [20] and Tadesse et al. [21], the disease may often be affected jointly by subsets of the genes while each individual gene might have a relatively weak effect. This study focuses on developing an efficient yet robust approach to profiling molecular features on the basis of their prognostic associations with the event time, taking advantage of the Bayesian framework for the proportional hazard model proposed by Kalbfleisch [22].
We acknowledge the high correlation between some molecular features due to the complicated genetic architecture. For example, genes involved in the same metabolic pathway may be similarly or oppositely regulated. These closely-related molecular features can result in collinearity between the candidates, and should therefore be grouped together in order to address their prognostic associations with the event time properly. Here, we group closely-related molecular features into linkage clusters. A centroid "gene" is constructed to represent each linkage cluster and thus partially solve the collinearity issue. As univariate Cox scores are unable to account for the complicated correlation structures among molecular features, we employ the Bayesian approach to construct a natural framework for molecular feature profiling.
We first propose a two-stage procedure for profiling prognostic molecular signatures for event times, and present the construction of linkage clusters as well as their centroids. A Bayesian framework of the Cox proportional hazard model is specified for "large p small n" data and a profiling criterion is described accordingly. The performance of our approach is evaluated via a simulation study and application to data concerning diffuse large B-cell lymphoma (DLBCL) [15].

Simulation study
To evaluate the performance of the proposed approach, we simulated 20 survival datasets, each having p = 1, 000 features and n = 125 independent individuals. The feature values were generated from an autoregressive process of order one, with autocorrelation ρ = 0.5 and unit variance white noise. The event times follow an exponential distribution of which the rate is determined by a linear combination of the 12 features with non-zero coefficients. Independent random censoring times were generated from standard exponential distributions, and this induced censoring of approximately 50% of the observed event times. Among the 1, 000 autocorrelated features, the indices of the 12 with non-zero coefficients are 150, 151, 300, 302, 450, 453, 600, 604, 750, 755, 900, 906, and their values alternate between 1 and -1. Such constant-magnitude coefficients were chosen in order to evaluate the effect of correlation among features on the profiling, as the correlations between the pairs of features, i.e., (150, 151), (300, 302), (450, 453), (600, 604), (750, 755), and (900, 906), decrease geometrically from 0.5 to 0.015625. As shown in Figure 1, these feature pairs have similar chances of being selected as top features while being ranked by the Bayesian approach. In this simulation study, the proposed Bayesian approach could select each non-zero coefficient feature with high probability (more than 0.8) when more than 12 features were selected in total. However, when univariate Cox scores are used, a feature pair with higher correlation is more likely to be among the selected top features, and in general, all 12 features are less likely to be correctly selected, as shown in Figure 2. The percentages of the 12 non-zero coefficient features selected into top features (i.e., success rates) are shown in Figure 3 when using the Bayesian approach, or the univariate Cox scores. The univariate Cox scores can lead to very high false discovery rates because the features with non-zero coefficients are usually ranked very low. Furthermore, as shown in Figure  3, the success rates of selecting features with non-zero coefficients are very low even when a large number of features are selected. On the other hand, when more than 12 features are selected using the Bayesian approach, the success rates are usually higher than 0.8 and approach 1 very quickly as more features are selected.

Application to a real dataset
We applied the proposed two-stage procedure to data on diffuse large B-cell lymphoma (DLBCL) [15]. These data include the expression levels of 7, 399 genes from a total of 240 patients. The genomic information for each patient was obtained at the beginning of the study, and the patients were followed up until death or the end of the project. The missing gene expression values were imputed using the nearest neighbor averaging approach [12,23]. Using the single linkage clustering approach in Cluster 3.0 [24], we identified 5,656 linkage clusters by pruning the hierarchical tree such that the node distances within branches are less than 0.2. There are 4,944 linkage clusters containing only one gene, while the largest has 186 genes. We then consider selecting prognostic molecular features Frequency of successes using the Bayesian approach   Table 1.
Employing our Bayesian approach to profile the 500 candidates with the largest Cox scores, the posterior probabilities, i.e., k defined in (2), of the top 25 clusters range from 0.0538 to 0.9825. However, the ranks of these 25 clusters vary widely when their univariate Cox scores are used, and only five of those with the top 25 univariate Cox scores appear in this list. Therefore, it may be misleading to profile the clusters for their prognostic ability on the basis of their univariate Cox scores, since many false prognostic features can be highly ranked owing to the complicated correlation structure among features.
When fewer than 500 candidates, for example, 100, 200 or 300, are profiled with the Bayesian approach, most of those that appeared in the top 25 of the 500 profiled candidates are also among the top 25 clusters as long as they are profiled. Indeed, the only exception is the cluster with two features in gene NM_00176, which was ranked at 61 when 300 candidates were profiled by the Bayesian approach. However, the complicated correlation structure between clusters makes it preferable to profile a number of clusters sufficient to avoid missing critical prognostic features. An exploratory selection of prognostic features from the top 25 clusters shown in Table 1 implies that 16 Frequency of successes using univariate Cox scores   [15]. D13666, which was reported by both Sha et al. [20] and Gui and Li [25], belongs to the lymph-node signature group, and BC012161 and AF134159 belong to the proliferation signature group (see Rosenwald et al. [15]). D42043, D88532, BC012161, and LC_33732 were also reported by Sha et al. [20]. It is interesting to observe that, among the 16 selected genes, AF414120 (gene CTLA4) is a member of the immunoglobulin superfamily and encodes a protein that transmits an inhibitory signal to T cells (Ling et al. [26]). AF127481, a lymphoid blast crisis oncogene (LBC), plays an important role in regulating the Rho/Rac GTPase cycle while the Rho/Rac family of small GTPases mediates cytoskeletal reorganization, gene transcription, and cell cycle progression through unique signal transduction pathways (Sterpetti et al. [27]). U46767 (gene CCL13) encodes a cytokine that plays a role in the accumulation of leukocytes during inflammation (Garcia-Zepeda et al. [28]). NM_000176 (gene NR3C1) encodes a receptor for glucocorticoids that can act as both a transcription factor and a regulator of other transcription factors. This protein can also be found in heteromeric cytoplasmic complexes along with heat shock factors and immunophilins (Subramaniam et al. [29]). X52186 (gene ITGB4) encodes the integrin beta 4 subunit, a receptor for the laminins, which tends to asso-

Number of selected features
Secuccess rate ciate with the alpha 6 subunit and is likely to play a pivotal role in the biology of invasive carcinoma (Hogervorst et al. [30]).

Discussion
With high-throughput techniques now available, there has been extensive recent discussion of disease-specific molecular signatures [31,32]. The whole genome and proteome profiles for each of the limited number of patients presents an enormous number of molecular candidates with a complicated correlation structure. Here we group highly-correlated molecular features into linkage clusters in order to profile prognostic signatures. While the molecular features within the same linkage clusters are expected to have similar prognostic association with the event time, physical linkages and metabolic pathways will be able to provide confirmatory information. Thereupon, we strongly suggest that available genome, proteome, and metaboleme information be explored and combined with observed profiles to construct the linkage clusters. By doing this, we can improve the reliability significantly and establish the biological functionality of linkage clusters without overusing the limited number of profiles.
When an optimal subset with a prespecified number of candidates is targeted, classical model selection approaches may be employed to explore all possible subsets and identify the best one. However, "small n large p" datasets may still obstruct this practice because of the enormous computation and unidentifiable models involved. Indeed, when p diverges as n → ∞, many classi-Cox Score plot for DLBCL data cal approaches may not work even if p <n. Although univariate Cox scores are frequently utilized to profile candidates and accordingly select the subset, it is risky to identify prognostic signatures by this approach as it is easy to include false signatures but miss the true ones owing to the strong correlations among molecular features. For example, when a molecular feature is positively correlated with both true signatures, it may happen that the false one, instead of the two true ones, is selected. Ein-Dor et al. [33] discussed the discrepancies while using a univariate approach. Built upon the multivariate proportional hazard model (1), the proposed Bayesian approach is able to search all possible subsets of a certain size stochastically via Gibbs sampling. With restrictive priors for "small n large p" datasets, the posterior probability k serves as a relative measure for profiling each candidate's prognostic association with the event time, accounting for other candidates. It is straightforward to extend this Bayesian approach to profile molecular signatures by controlling other clinical factors [3] and considering microenvironments [34].
The three-component prior for the coefficients in model (1) is crucial in constructing the profiling criterion. First, the prior probability of each component can be controlled with a uniform distribution on a subset of [0,1] to guarantee that the model is identifiable such that a Gibbs sampler can feasibly be employed to search the parameter p

Construction of linkage clusters and their centroids
To facilitate pattern recognition and reveal otherwise hidden structures and functions, genes and proteins are usually clustered into groups based on different biological metrics, such as sequence similarity [35] or expression profiles [12,36]. With gene expression data only, many approaches have been applied to cluster genes that exhibit similar expression profiles across samples (see the review by Jörnsten and Yu, [36]). As we are interested in the prognostic relationships of genes to the event time, highly correlated genes are more likely to have the same power to predict the event time and therefore should be grouped into the same cluster. Here we define linkage clusters as groups of genes with large pairwise correlations in absolute value. As shown by Shaffer et al. [37], these linkage clusters of molecular features may reveal functional similarity/dissimilarity. Proteins regulated in the same metabolic pathway may also act similarly or oppositely. Although expressions of genes/proteins are usually observed with measurement errors, these linkage clusters can still be identified in experimental data. As shown in Figure 5, the molecular features can be mutually correlated as highly as correlation coefficient |ρ| ≥ 0.96. From a statistical point of view, these closely-related molecular features can cause collinearity or near collinearity in multivariate identification of prognostic signatures and therefore destabilize the identification result if all these closelyrelated molecular features are included in the prognostic model.
Hierarchical clustering approaches (e.g., the complete linkage clustering or single linkage clustering approach in Cluster 3.0 by de Hoon et al. [24]) can be used to construct these linkage clusters. With the mutual correlation coefficients estimated from the data, we use the absolute values of correlation coefficients as the similarity scores, i.e., with the distance measure d = 1 -|ρ|. We prune the hierarchical tree with a prespecified value for the distance, e.g., d ≤ 0.2 (hence absolute values of the correlation coef-p Illustration of Linkage Blocks  ficients are no less than 0.8). The molecular features within the same branch are assumed to be within the same linkage cluster. The centroid of the linkage cluster is used to represent all the elements within the linkage cluster, and subsequent identification of prognostic signatures proceeds by associating these centroids only with the event time.
The expression levels of the centroids are calculated by standardizing expressions of all genes. More specifically, for each linkage cluster, we first randomly select one gene and reverse the expression signs of all genes within the cluster that are negatively correlated with this gene. Then the expression level of the centroid is calculated by averaging the expression levels of all the genes within the same linkage cluster. Meanwhile, the measurement errors in expression levels are attenuated after averaging the gene expression levels within the same linkage cluster.

Bayesian framework of proportional hazard model
Suppose that p linkage clusters are identified and therefore the expressions of the p centroids are calculated for each of the n individuals. The observed event time for the i-th individual is denoted y i , with δ i indicating whether it was right-censored owing to loss in follow-up (i.e., δ i = 0 if right-censored and δ i = 1 otherwise). Accordingly, the expressions of the centroids are denoted z i = (z i1 , z i2 , …, z ip ). We use the popular Proportional Hazard Model [4] to associate the molecular features with the event time, i.e., the hazard function is modeled as follows: λ(t|z i ) = λ 0 (t)exp(z i β), (1) where β includes the p coefficients for all the centroids, and λ 0 (·) is an unspecified baseline hazard function. With a large number of available linkage clusters, the time-to-event of interest may be associated with a relatively small number of linkage clusters. On the other hand, the available "large p small n" data sets hamper us in detecting linkage clusters with too weak effects on the time-to-event of interest, and we expect to be able to iden-tify those linkage clusters with strong effects. We therefore incorporate this important prior information by considering the following prior distribution for each β k : where N + (µ, σ 2 ) and N -(µ, σ 2 ) are the truncated Gaussian distributions with only positive and negative parts, respectively. As shown by Zhang et al. [38] and Zhang et al. [39], this three-component prior has some theoretical properties and allows a possible imbalance between scales and/ or sizes of positive and negative coefficients in model (1).
A priori, the hyperparameters and are assumed to follow inverse gamma distributions as IG(1, φ + ) and IG(1, φ -), respectively. Here, sufficiently large φ + and φare recommended to approximate the noninformative priors 1/ and 1/ , respectively. For the prior distribution of (w + ,w -), a noninformative prior (such as Dirichlet(1, 1, 1)) is not applicable since the model is not identifiable with p » n. As shown in Zhang et al. [40], the number of reliably identified significant predictors is limited by the sample size n. Following Zhang et al. [39] and Zhang et al. [41], we let which guarantees the model to be identifiable. On the other hand, as the number of candidate predictors is large, the upper bound, 2 /p, on (w + + w -) can be so restrictive that the resultant posterior probability for a true predictor to be significant can be very small. However, these posterior probabilities, as relative measures of significance, play an important role in profiling all features for their prognostic relations to the event time.

The Gibbs sampler
In view of the large number of parameters to be estimated, we consider a Gibbs sampler to obtain the posterior distributions of the parameters and make inferences. The Gibbs sampler can be developed by iteratively sampling each parameter from its full conditional distribution.
For simplicity, let β -k include all components of β except β k , and write g k (β k |β -k , ) = PL(β| ) when β k is of par- Here, w k+ and w k-are normalization coefficients, which can be calculated as The full conditional distribution of w + and wis (w + , w -, 1 -w + -w -)|β Dirichlet( + , -, p -+ --), w + + w -≤ , where + = #{k : β k > 0} and -= #{k : β k < 0}. Finally, the full conditional distribution of and are The parameters were initialized on the basis of estimators from univariate approaches. After the initial burn-in period (5,000 in the following analysis), the next 5,000 iterations in the Markov chain were used for inference without thinning. Convergence of the algorithm was checked by the diagnostic tools in Cowles and Carlin [42].

Profiling criterion
The significance of each centroid in model (1) is determined by one pair of parameters. They are, for the j-th centroid, the posterior probabilities p k+ = P(β k > 0| ) and /p on (w + + w -) may not be restrictive and β k can be estimated with the median value of its posterior probability. However, p is usually much larger than n in gene expression data and, as a result, p k+ and p k-may be heavily shrunk to zero. Therefore, identifying significant prognostic centroids with the posterior median values will be too conservative. Instead, we suggest profiling the prognostic association of k-th centroid to the event time by k = max{p k+ , p k-}, (2) which is a relative measure when p is much larger than n.
As demonstrated in the simulation study, this profiling criterion performs much better than the popular univariate Cox score.