- Research
- Open Access
- Published:

# Bayesian profiling of molecular signatures to predict event times

*Theoretical Biology and Medical Modelling*
**volume 4**, Article number: 3 (2007)

## Abstract

### Background

It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis.

### Results

We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approach. When it was applied to data on diffuse large B-cell lymphoma (DLBCL), we were able to identify some new candidate signatures for disease prognosis.

### Conclusion

This multivariate approach provides researchers with a more reliable list of molecular features profiled in terms of their prognostic relationship to the event times, and generates dependable information for subsequent identification of prognostic molecular signatures through either biological procedures or further data analysis.

## Background

High-throughput biotechnologies such as microarray and mass spectrometry permit simultaneous measurements of enormous bodies of genomic, proteomic, and metabolic information to be made. Such information helps us understand the molecular basis of important clinical outcomes, and thus improves the efficiency as well as accuracy in clinical decision making. More specifically, a small subset of these molecules can be used as biomarkers in daily clinical practice for detecting disease at early stages, measuring disease progress, monitoring the efficacy of treatments, and potentially accelerating the drug discovery process. However, the promise of genomics, proteomics, and metabolomics in clinical medicine rests on identifying these disease-specific molecular signatures. Clinical and preclinical studies of patients' genomics and proteomics profiles usually present datasets that share common characteristics, i.e., many molecular features ("large *p*") collected from few individuals ("small *n*"). The statistical challenge is to mine prognostic signatures from thousands of candidates by efficiently extracting information from samples of limited size, i.e., "small *n* large *p*" datasets. Moreover, the clinical outcomes measured for certain patients, e.g., survival times of cancer patients, are usually censored data, which further complicates the statistical analysis. There has been extensive research on the classification and prediction of cancer using gene expression information [1–3], but there has been less progress in identifying individual molecules that can be used to predict the clinical outcome. We devote this paper to developing a Bayesian approach to profile molecular features on the basis of their prognostic relations to event times.

The proportional hazard model [4] has a long history in modeling the association of risk factors to the right-censored event times observed in clinical study [5, 6]. Through this model, it has been of special interest to develop a systematic approach to identifying molecular signatures for event times with "small n large p" datasets. However, the overwhelmingly larger number of molecular candidates compared to the number of individuals prohibits exhaustive variable selection because of the heavy computation and model-overfitting considerations. A variety of strategies have been proposed in the literature. The first is to reduce the list of genotypic candidates by univariately associating each of them with phenotypic clinical outcome [1, 7], and then regress the clinical outcome on the selected candidates. The second employs principal component analysis (PCA) to build up "eigengenes" (i.e., linear combinations of genes) and associates these with phenotypic clinical outcomes, and the identification of molecular signatures is further explored on the basis of these [8]. The third strategy employs partial least squares (PLS) [9, 10] to construct orthogonal "eigengenes" [11]. Other strategies have also been used to reveal interesting prognostic molecular signatures for certain event times [12–15]. Recently, Tadesse et al. [16] proposed a Bayesian error-in-variable survival model to identify genes of which the expression levels are associated with survival outcome. It is widely accepted that most genes measured in microarray experiments provide little information for predicting patient survival, so a necessary step in the analysis is to reduce the number of candidates before identifying prognostic molecular signatures with a relatively small sample. This reduction is usually carried out by ranking molecular features (either the original molecular candidates or the "eigengenes") according to either z scores [7] or Cox scores [17–19], which measure the univariate association of each molecular feature with the event time. Several top-ranked molecular features are further explored for their prognostic associations with the event time. As shown in our simulation study, employing the univariate Cox scores to profile molecular features can be misleading as it may miss many important candidates but select many false-prognostic ones. Indeed, molecular features with high univariate association to the event time may not necessarily predict the event time effectively when applied together. As shown by Sha et al. [20] and Tadesse et al. [21], the disease may often be affected jointly by subsets of the genes while each individual gene might have a relatively weak effect. This study focuses on developing an efficient yet robust approach to profiling molecular features on the basis of their prognostic associations with the event time, taking advantage of the Bayesian framework for the proportional hazard model proposed by Kalbfleisch [22].

We acknowledge the high correlation between some molecular features due to the complicated genetic architecture. For example, genes involved in the same metabolic pathway may be similarly or oppositely regulated. These closely-related molecular features can result in collinearity between the candidates, and should therefore be grouped together in order to address their prognostic associations with the event time properly. Here, we group closely-related molecular features into linkage clusters. A centroid "gene" is constructed to represent each linkage cluster and thus partially solve the collinearity issue. As univariate Cox scores are unable to account for the complicated correlation structures among molecular features, we employ the Bayesian approach to construct a natural framework for molecular feature profiling.

We first propose a two-stage procedure for profiling prognostic molecular signatures for event times, and present the construction of linkage clusters as well as their centroids. A Bayesian framework of the Cox proportional hazard model is specified for "large *p* small *n*" data and a profiling criterion is described accordingly. The performance of our approach is evaluated via a simulation study and application to data concerning diffuse large B-cell lymphoma (DLBCL) [15].

## Results

### Simulation study

To evaluate the performance of the proposed approach, we simulated 20 survival datasets, each having *p* = 1, 000 features and *n* = 125 independent individuals. The feature values were generated from an autoregressive process of order one, with autocorrelation *ρ* = 0.5 and unit variance white noise. The event times follow an exponential distribution of which the rate is determined by a linear combination of the 12 features with non-zero coefficients. Independent random censoring times were generated from standard exponential distributions, and this induced censoring of approximately 50% of the observed event times. Among the 1, 000 autocorrelated features, the indices of the 12 with non-zero coefficients are 150, 151, 300, 302, 450, 453, 600, 604, 750, 755, 900, 906, and their values alternate between 1 and -1. Such constant-magnitude coefficients were chosen in order to evaluate the effect of correlation among features on the profiling, as the correlations between the pairs of features, i.e., (150, 151), (300, 302), (450, 453), (600, 604), (750, 755), and (900, 906), decrease geometrically from 0.5 to 0.015625. As shown in Figure 1, these feature pairs have similar chances of being selected as top features while being ranked by the Bayesian approach. In this simulation study, the proposed Bayesian approach could select each non-zero coefficient feature with high probability (more than 0.8) when more than 12 features were selected in total. However, when univariate Cox scores are used, a feature pair with higher correlation is more likely to be among the selected top features, and in general, all 12 features are less likely to be correctly selected, as shown in Figure 2. The percentages of the 12 non-zero coefficient features selected into top features (i.e., success rates) are shown in Figure 3 when using the Bayesian approach, or the univariate Cox scores. The univariate Cox scores can lead to very high false discovery rates because the features with non-zero coefficients are usually ranked very low. Furthermore, as shown in Figure 3, the success rates of selecting features with non-zero coefficients are very low even when a large number of features are selected. On the other hand, when more than 12 features are selected using the Bayesian approach, the success rates are usually higher than 0.8 and approach 1 very quickly as more features are selected.

### Application to a real dataset

We applied the proposed two-stage procedure to data on diffuse large B-cell lymphoma (DLBCL) [15]. These data include the expression levels of 7, 399 genes from a total of 240 patients. The genomic information for each patient was obtained at the beginning of the study, and the patients were followed up until death or the end of the project. The missing gene expression values were imputed using the nearest neighbor averaging approach [12, 23]. Using the single linkage clustering approach in Cluster 3.0 [24], we identified 5,656 linkage clusters by pruning the hierarchical tree such that the node distances within branches are less than 0.2. There are 4,944 linkage clusters containing only one gene, while the largest has 186 genes. We then consider selecting prognostic molecular features from the 5, 656 candidates, each being the centroid of a linkage cluster.

The univariate Cox scores of all candidate clusters are calculated and shown in decreasing order in Figure 4. There are 761 candidates with Cox scores above the 95 percentile of the {\chi}_{1}^{2} distribution, and 290 candidates with Cox scores above the 99 percentile of the {\chi}_{1}^{2} distribution. We selected the top 100, 200, 300, and 500 candidates with the largest Cox scores and applied our Bayesian method to profile them. The top 25 of the 500 candidates are listed in Table 1.

Employing our Bayesian approach to profile the 500 candidates with the largest Cox scores, the posterior probabilities, i.e., \tilde{p}_{
k
}defined in (2), of the top 25 clusters range from 0.0538 to 0.9825. However, the ranks of these 25 clusters vary widely when their univariate Cox scores are used, and only five of those with the top 25 univariate Cox scores appear in this list. Therefore, it may be misleading to profile the clusters for their prognostic ability on the basis of their univariate Cox scores, since many false prognostic features can be highly ranked owing to the complicated correlation structure among features.

When fewer than 500 candidates, for example, 100, 200 or 300, are profiled with the Bayesian approach, most of those that appeared in the top 25 of the 500 profiled candidates are also among the top 25 clusters as long as they are profiled. Indeed, the only exception is the cluster with two features in gene NM_00176, which was ranked at 61 when 300 candidates were profiled by the Bayesian approach. However, the complicated correlation structure between clusters makes it preferable to profile a number of clusters sufficient to avoid missing critical prognostic features. An exploratory selection of prognostic features from the top 25 clusters shown in Table 1 implies that 16 genes may be considered to construct prognostic features for the event time, and some of these features were ignored from the lists of 100, 200, and 300 candidates chosen on the basis of their univariate Cox scores. The cluster with 38 features from 11 genes is not one of the 16 selected, though all those genes except AK000170 belong to the MHC class II signature group defined by Rosenwald et al. [15]. D13666, which was reported by both Sha et al. [20] and Gui and Li [25], belongs to the lymph-node signature group, and BC012161 and AF134159 belong to the proliferation signature group (see Rosenwald et al. [15]). D42043, D88532, BC012161, and LC_33732 were also reported by Sha et al. [20]. It is interesting to observe that, among the 16 selected genes, AF414120 (gene CTLA4) is a member of the immunoglobulin superfamily and encodes a protein that transmits an inhibitory signal to T cells (Ling et al. [26]). AF127481, a lymphoid blast crisis oncogene (LBC), plays an important role in regulating the Rho/Rac GTPase cycle while the Rho/Rac family of small GTPases mediates cytoskeletal reorganization, gene transcription, and cell cycle progression through unique signal transduction pathways (Sterpetti et al. [27]). U46767 (gene CCL13) encodes a cytokine that plays a role in the accumulation of leukocytes during inflammation (Garcia-Zepeda et al. [28]). NM_000176 (gene NR3C1) encodes a receptor for glucocorticoids that can act as both a transcription factor and a regulator of other transcription factors. This protein can also be found in heteromeric cytoplasmic complexes along with heat shock factors and immunophilins (Subramaniam et al. [29]). X52186 (gene ITGB4) encodes the integrin beta 4 subunit, a receptor for the laminins, which tends to associate with the alpha 6 subunit and is likely to play a pivotal role in the biology of invasive carcinoma (Hogervorst et al. [30]).

## Discussion

With high-throughput techniques now available, there has been extensive recent discussion of disease-specific molecular signatures [31, 32]. The whole genome and proteome profiles for each of the limited number of patients presents an enormous number of molecular candidates with a complicated correlation structure. Here we group highly-correlated molecular features into linkage clusters in order to profile prognostic signatures. While the molecular features within the same linkage clusters are expected to have similar prognostic association with the event time, physical linkages and metabolic pathways will be able to provide confirmatory information. Thereupon, we strongly suggest that available genome, proteome, and metaboleme information be explored and combined with observed profiles to construct the linkage clusters. By doing this, we can improve the reliability significantly and establish the biological functionality of linkage clusters without overusing the limited number of profiles.

When an optimal subset with a prespecified number of candidates is targeted, classical model selection approaches may be employed to explore all possible subsets and identify the best one. However, "small *n* large *p*" datasets may still obstruct this practice because of the enormous computation and unidentifiable models involved. Indeed, when *p* diverges as *n* → ∞, many classical approaches may not work even if *p* <*n*. Although univariate Cox scores are frequently utilized to profile candidates and accordingly select the subset, it is risky to identify prognostic signatures by this approach as it is easy to include false signatures but miss the true ones owing to the strong correlations among molecular features. For example, when a molecular feature is positively correlated with both true signatures, it may happen that the false one, instead of the two true ones, is selected. Ein-Dor et al. [33] discussed the discrepancies while using a univariate approach. Built upon the multivariate proportional hazard model (1), the proposed Bayesian approach is able to search all possible subsets of a certain size stochastically via Gibbs sampling. With restrictive priors for "small *n* large *p*" datasets, the posterior probability \tilde{p}_{
k
}serves as a relative measure for profiling each candidate's prognostic association with the event time, accounting for other candidates. It is straightforward to extend this Bayesian approach to profile molecular signatures by controlling other clinical factors [3] and considering microenvironments [34].

The three-component prior for the coefficients in model (1) is crucial in constructing the profiling criterion. First, the prior probability of each component can be controlled with a uniform distribution on a subset of [0,1] to guarantee that the model is identifiable such that a Gibbs sampler can feasibly be employed to search the parameter space stochastically. Putting these prior probabilities on a restricted interval allows various numbers of nonzero coefficients in model (1). Second, the three-component prior approach provides flexibility in the possible imbalance between the scales and/or sizes of positive and negative coefficients in the model. Third, the three-component prior automatically results in a three-component posterior distribution for each coefficient, with the posterior probability of each component available for further calculation. In summary, the profiling criterion, posterior probability (\tilde{p}_{
k
}), has a natural explanation and can be easily implemented in practice.

## Methods

### Construction of linkage clusters and their centroids

To facilitate pattern recognition and reveal otherwise hidden structures and functions, genes and proteins are usually clustered into groups based on different biological metrics, such as sequence similarity [35] or expression profiles [12, 36]. With gene expression data only, many approaches have been applied to cluster genes that exhibit similar expression profiles across samples (see the review by Jörnsten and Yu, [36]). As we are interested in the prognostic relationships of genes to the event time, highly correlated genes are more likely to have the same power to predict the event time and therefore should be grouped into the same cluster. Here we define linkage clusters as groups of genes with large pairwise correlations in absolute value. As shown by Shaffer et al. [37], these linkage clusters of molecular features may reveal functional similarity/dissimilarity. Proteins regulated in the same metabolic pathway may also act similarly or oppositely. Although expressions of genes/proteins are usually observed with measurement errors, these linkage clusters can still be identified in experimental data. As shown in Figure 5, the molecular features can be mutually correlated as highly as correlation coefficient |*ρ*| ≥ 0.96. From a statistical point of view, these closely-related molecular features can cause collinearity or near collinearity in multivariate identification of prognostic signatures and therefore destabilize the identification result if all these closely-related molecular features are included in the prognostic model.

Hierarchical clustering approaches (e.g., the complete linkage clustering or single linkage clustering approach in Cluster 3.0 by de Hoon et al. [24]) can be used to construct these linkage clusters. With the mutual correlation coefficients estimated from the data, we use the absolute values of correlation coefficients as the similarity scores, i.e., with the distance measure *d* = 1 - |*ρ*|. We prune the hierarchical tree with a prespecified value for the distance, e.g., *d* ≤ 0.2 (hence absolute values of the correlation coefficients are no less than 0.8). The molecular features within the same branch are assumed to be within the same linkage cluster. The centroid of the linkage cluster is used to represent all the elements within the linkage cluster, and subsequent identification of prognostic signatures proceeds by associating these centroids only with the event time.

The expression levels of the centroids are calculated by standardizing expressions of all genes. More specifically, for each linkage cluster, we first randomly select one gene and reverse the expression signs of all genes within the cluster that are negatively correlated with this gene. Then the expression level of the centroid is calculated by averaging the expression levels of all the genes within the same linkage cluster. Meanwhile, the measurement errors in expression levels are attenuated after averaging the gene expression levels within the same linkage cluster.

### Bayesian framework of proportional hazard model

Suppose that *p* linkage clusters are identified and therefore the expressions of the *p* centroids are calculated for each of the *n* individuals. The observed event time for the *i*-th individual is denoted *y*_{
i
}, with *δ*_{
i
}indicating whether it was right-censored owing to loss in follow-up (i.e., *δ*_{
i
}= 0 if right-censored and *δ*_{
i
}= 1 otherwise). Accordingly, the expressions of the centroids are denoted *z*_{
i
}= (*z*_{i 1}, *z*_{i 2}, …, *z*_{
ip
}). We use the popular Proportional Hazard Model [4] to associate the molecular features with the event time, i.e., the hazard function is modeled as follows:

*λ*(*t*|*z*_{
i
}) = *λ*_{0}(*t*)exp(*z*_{
i
}** β**), (1)

where ** β** includes the

*p*coefficients for all the centroids, and

*λ*

_{0}(·) is an unspecified baseline hazard function.

Further, let \mathcal{D} = {(*y*_{
i
}, *δ*_{
i
}, *z*_{
i
}) : *i* = 1, 2, …, *n*} be the observed data, and \mathcal{R}(*t*) = {*i* : *y*_{
i
}≥ *t*} be the risk set at time *t*. Following Kalbfleisch [22], we construct the Bayesian framework to estimate ** β** by considering only the partial likelihood function,

PL(\beta |\mathcal{D})={{\displaystyle \prod _{i=1}^{n}\left\{\frac{\mathrm{exp}\left\{{z}_{i}\beta \right\}}{{\displaystyle {\sum}_{j\in \mathcal{R}({y}_{i})}\mathrm{exp}\left\{{z}_{j}\beta \right\}}}\right\}}}^{{\delta}_{i}},

which avoids the nuisance baseline hazard function *λ*_{0}(*t*).

With a large number of available linkage clusters, the time-to-event of interest may be associated with a relatively small number of linkage clusters. On the other hand, the available "large *p* small *n*" data sets hamper us in detecting linkage clusters with too weak effects on the time-to-event of interest, and we expect to be able to identify those linkage clusters with strong effects. We therefore incorporate this important prior information by considering the following prior distribution for each *β*_{
k
}:

*β*_{
k
} ~ (1 - *w*_{+} - *w*_{-})*δ*_{{0}} + *w*_{+}*N*_{+}(0, {\sigma}_{+}^{2}) + *w*_{-}*N*_{-}(0, {\sigma}_{-}^{2}),

where *N*_{+}(*μ*, *σ*^{2}) and *N*_{-}(*μ*, *σ*^{2}) are the truncated Gaussian distributions with only positive and negative parts, respectively. As shown by Zhang et al. [38] and Zhang et al. [39], this three-component prior has some theoretical properties and allows a possible imbalance between scales and/or sizes of positive and negative coefficients in model (1).

*A priori*, the hyperparameters {\sigma}_{+}^{2} and {\sigma}_{-}^{2} are assumed to follow inverse gamma distributions as *IG*(1, *φ*_{+}) and *IG*(1, *φ*_{-}), respectively. Here, sufficiently large *φ*_{+} and *φ*_{-} are recommended to approximate the noninformative priors 1/{\sigma}_{+}^{2} and 1/{\sigma}_{-}^{2}, respectively. For the prior distribution of (*w*_{+},*w*_{-}), a noninformative prior (such as *Dirichlet*(1, 1, 1)) is not applicable since the model is not identifiable with *p* » *n*. As shown in Zhang et al. [40], the number of reliably identified significant predictors is limited by the sample size *n*. Following Zhang et al. [39] and Zhang et al. [41], we let

*w*_{+} + *w*_{-} ~ *Unif*(0, 2\sqrt{n}/*p*),

which guarantees the model to be identifiable. On the other hand, as the number of candidate predictors is large, the upper bound, 2\sqrt{n}/*p*, on (*w*_{+} + *w*_{-}) can be so restrictive that the resultant posterior probability for a true predictor to be significant can be very small. However, these posterior probabilities, as relative measures of significance, play an important role in profiling all features for their prognostic relations to the event time.

### The Gibbs sampler

In view of the large number of parameters to be estimated, we consider a Gibbs sampler to obtain the posterior distributions of the parameters and make inferences. The Gibbs sampler can be developed by iteratively sampling each parameter from its full conditional distribution.

For simplicity, let *β*_{-k}include all components of ** β** except

*β*

_{ k }, and write

*g*

_{ k }(

*β*

_{ k }|

*β*

_{-k}, \mathcal{D}) =

*PL*(

*β*|\mathcal{D}) when

*β*

_{ k }is of particular interest. Then the full conditional distribution of

*β*

_{ k }is a mixture of a point mass at zero and two continuous distributions, i.e.,

*β*_{
k
}|*w*_{+}, *w*_{-},{\sigma}_{+}^{2},{\sigma}_{-}^{2},\mathcal{D}

~ (1 - \tilde{w}_{k+}- \tilde{w}_{k-})*δ*_{{0}} + \tilde{w}_{k+}\tilde{F}_{k+}+ \tilde{w}_{k-}\tilde{F}_{k-},

where \tilde{F}_{k+}and \tilde{F}_{k-}zare the distributions corresponding to the following probability density functions,

{\tilde{f}}_{k+}(x)=\frac{{g}_{k}(x|{\beta}_{-k},\mathcal{D})}{{w}_{k+}\sqrt{2\pi {\sigma}_{+}^{2}}}\mathrm{exp}\left\{-\frac{{x}^{2}}{2{\sigma}_{+}^{2}}\right\}I[x>0],

{\tilde{f}}_{k-}(x)=\frac{{g}_{k}(x|{\beta}_{-k},\mathcal{D})}{{w}_{k-}\sqrt{2\pi {\sigma}_{-}^{2}}}\mathrm{exp}\left\{-\frac{{x}^{2}}{2{\sigma}_{-}^{2}}\right\}I[x<0],

and the probabilities for *β*_{
k
}to be positive and negative are, respectively,

{\tilde{w}}_{k+}=\frac{2{w}_{+}{w}_{k+}}{(1-{w}_{+}-{w}_{-}){g}_{k}(0|{\beta}_{-k},\mathcal{D})+2{w}_{+}{w}_{k+}+2{w}_{-}{w}_{k-}},

{\tilde{w}}_{k-}=\frac{2{w}_{-}{w}_{k-}}{(1-{w}_{+}-{w}_{-}){g}_{k}(0|{\beta}_{-k},\mathcal{D})+2{w}_{+}{w}_{k+}+2{w}_{-}{w}_{k-}}.

Here, *w*_{k+}and *w*_{k-}are normalization coefficients, which can be calculated as

{w}_{k+}={\displaystyle {\int}_{0}^{\infty}\frac{{g}_{k}(x|{\beta}_{-k},\mathcal{D})}{\sqrt{2\pi {\sigma}_{+}^{2}}}}\mathrm{exp}\left\{-\frac{{x}^{2}}{2{\sigma}_{+}^{2}}\right\}dx,

{w}_{k-}={\displaystyle {\int}_{-\infty}^{0}\frac{{g}_{k}(x|{\beta}_{-k},\mathcal{D})}{\sqrt{2\pi {\sigma}_{-}^{2}}}}\mathrm{exp}\left\{-\frac{{x}^{2}}{2{\sigma}_{-}^{2}}\right\}dx.

The full conditional distribution of *w*_{+} and *w*_{-} is

(*w*_{+}, *w*_{-}, 1 - *w*_{+} - *w*_{-})|*β*

~ *Dirichlet*(\tilde{p}_{+}, \tilde{p}_{-}, *p* - \tilde{p}_{+} - \tilde{p}_{-}), *w*_{+} + *w*_{-} ≤ \frac{2\sqrt{n}}{p},

where \tilde{p}_{+} = #{*k* : *β*_{
k
} > 0} and \tilde{p}_{-} = #{*k* : *β*_{
k
} < 0}. Finally, the full conditional distribution of {\sigma}_{+}^{2} and {\sigma}_{-}^{2} are

{\sigma}_{+}^{-2}|\beta ~Gamma\left(\frac{{\tilde{p}}_{+}}{2},{\left(\frac{1}{{\phi}_{+}}+\frac{1}{2}{\displaystyle \sum _{k=1}^{p}{\beta}_{k}^{2}}I[{\beta}_{k}>0]\right)}^{-1}\right),

{\sigma}_{-}^{-2}|\beta ~Gamma\left(\frac{{\tilde{p}}_{-}}{2},{\left(\frac{1}{{\phi}_{-}}+\frac{1}{2}{\displaystyle \sum _{k=1}^{p}{\beta}_{k}^{2}}I[{\beta}_{k}<0]\right)}^{-1}\right).

The parameters were initialized on the basis of estimators from univariate approaches. After the initial burn-in period (5,000 in the following analysis), the next 5,000 iterations in the Markov chain were used for inference without thinning. Convergence of the algorithm was checked by the diagnostic tools in Cowles and Carlin [42].

### Profiling criterion

The significance of each centroid in model (1) is determined by one pair of parameters. They are, for the *j*-th centroid, the posterior probabilities *p*_{k+}= *P*(*β*_{
k
}> 0|\mathcal{D}) and *p*_{k-}= *P*(*β*_{
k
}< 0|\mathcal{D}). Given data \mathcal{D}, the marginal posterior distribution of *β*_{
k
}is still a mixture of three components, i.e., being positive with probability *p*_{k+}= *E*[\tilde{w}_{k+}|\mathcal{D}], being negative with probability *p*_{k-}= *E*[\tilde{w}_{k-}|\mathcal{D}], and having a point mass at zero with probability 1 - *p*_{k+}- *p*_{k-}. The two parameters *p*_{k+}and *p*_{k-}can be estimated from the Markov chains of \tilde{w}_{βk+}and \tilde{w}_{βk-}drawn from the above Gibbs sampler. With moderately large *p*, the upper bound 2\sqrt{n}/*p* on (*w*_{+} + *w*_{-}) may not be restrictive and *β*_{
k
}can be estimated with the median value of its posterior probability. However, *p* is usually much larger than *n* in gene expression data and, as a result, *p*_{k+}and *p*_{k-}may be heavily shrunk to zero. Therefore, identifying significant prognostic centroids with the posterior median values will be too conservative. Instead, we suggest profiling the prognostic association of *k*-th centroid to the event time by

_{
k
}= max{*p*_{k+}, *p*_{k-}}, (2)

which is a relative measure when *p* is much larger than *n*. As demonstrated in the simulation study, this profiling criterion performs much better than the popular univariate Cox score.

## References

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression profiling. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.

Li H, Gui J: Partial Cox regression analysis for high-dimensional microarray gene expression data. ISMB04/Bioinformatics. 2004, 20: i208-i215. 10.1093/bioinformatics/bth900.

Li L: Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information. Bioinformatics. 2006, 22: 466-471. 10.1093/bioinformatics/bti824.

Cox DR: Regression models and life tables. J R Stat Soc Ser B. 1972, 39: 264-296.

Al-katib A: Treatment of diffuse poorly differentiated lymphocytic lymphoma: an analysis of prognostic variables. Cancer. 1984, 53: 2404-2412. 10.1002/1097-0142(19840601)53:11<2404::AID-CNCR2820531107>3.0.CO;2-F.

Papatestas AE, Miller SR, Pertsemlidis D, Fagerstrom R, Lesnick G, Aufses AH: Association between prognosis and hormone receptors in women breast cancer. Cancer Detection and Prevention. 1986, 9: 303-310.

Lossos IS, Czerwinski DK, Alizadeh AA, Wechser MA, Tibshirani R, Botstein D, Levy R: Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Eng J Med. 2004, 350: 1828-1837. 10.1056/NEJMoa032520.

West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JAJ, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA. 2001, 98: 11462-11467. 10.1073/pnas.201162998.

Wold H: Estimation of principal components and related models by iterative least squares. Multivariate Analysis. Edited by: Krishnaiaah PR. 1966, London: Academic Press, 391-420.

Garthwaite PH: An interpretation of partial least squares. J Am Stat Assoc. 1994, 89: 122-127. 10.2307/2291207.

Nguyen DV, Rocke DM: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002, 18: 39-50. 10.1093/bioinformatics/18.1.39.

Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P: 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology. 2000

Li H, Luan Y: Kernel Cox regression models for linking gene expression profiles to censored survival data. Pacific Symposium of Biocomputing. 2003, 8: 65-76.

Park PJ, Tian L, Kohane IS: Linking expression data with patient survival times using partial least squares. Bioinformatics. 2002, 18: 1625-1632. 10.1093/bioinformatics/18.12.1625.

Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, Lopez-Guillermo A, Grogan TM, Miller TP, LeBlanc M, Ott G, Kvaloy S, Delabie J, Holte H, Krajci P, Stokke T, Staudt LM, Project LMP: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New Engl J Med. 2002, 346: 1937-1947. 10.1056/NEJMoa012914.

Tadesse MG, Ibrahim JG, Gentleman R, Chiaretti S, Ritz J, Foa R: Bayesian error-in-variable model for the analysis of genechip arrays. Biometrics. 2005, 61: 488-497. 10.1111/j.1541-0420.2005.00313.x.

Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JMG, Iannetton MD, Orringer MB, Hanash S: Gene-espression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 2002, 8: 816-824.

Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology. 2004, 2: 511-522. 10.1371/journal.pbio.0020108.

Li H, Luan Y: Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data. Bioinformatics. 2005, 21: 2403-2409. 10.1093/bioinformatics/bti324.

Sha N, Tadesse MG, Vannucci M: Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006, 22: 2262-2268. 10.1093/bioinformatics/btl362.

Tadesse MG, Sha N, Vannucci M: Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005, 100: 602-617. 10.1198/016214504000001565.

Kalbfleisch JD: Nonparametric Bayesian analysis of survival time data. J R Stat Soc Ser B. 1978, 40: 214-221.

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics. 2001, 17: 520-525. 10.1093/bioinformatics/17.6.520.

De Hoon MJL, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics. 2004, 20: 1453-1454. 10.1093/bioinformatics/bth078.

Gui J, Li H: Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005, 21: 3001-3008. 10.1093/bioinformatics/bti422.

Ling V, Wu PW, Finnerty HF, Sharpe AH, Gray GS, Collins M: Complete sequence determination of the mouse and human CTLA4 gene loci: cross-species DNA sequence similarity beyond exon borders. Genomics. 1999, 60: 341-355. 10.1006/geno.1999.5930.

Sterpetti P, Hack AA, Bashar MP, Park B, Cheng SD, Knoll JHM, Urano T, Feig LA, Toksoz D: Activation of the Lbc Rho exchange factor proto-oncogene by truncation of an extended C terminus that regulates transformation and targeting. Molec Cell Biol. 1999, 19: 1334-1345.

Garcia-Zepeda EA, Combadiere C, Rothenberg ME, Sarafi MN, Lavigne F, Hamid Q, Murphy PM, Luster AD: Human monocyte chemoattractant protein (MCP)-4 is a novel CC chemokine with activities on monocytes, eosinophils, and basophils induced in allergic and nonallergic inflammation that signals through the CC chemokine receptors (CCR)-2 and -3. J Immunol. 1996, 157: 5613-5626.

Subramaniam M, Colvard D, Keeting PE, Rasmussen K, Riggs BL, Spelsberg TC: Glucocorticoid regulation of alkaline phosphatase, osteocalcin, and proto-oncogenes in normal human osteoblast-like cells. J Cell Biochem. 1992, 50: 411-424. 10.1002/jcb.240500410.

Hogervorst F, Kuikman I, von dem Borne AE, Sonnenberg A: Cloning and sequence analysis of beta-4 cDNA: an integrin subunit that contains a unique 118 kd cytoplasmic domain. EMBO J. 1990, 9: 765-770.

Campbell CJ, Ghazal P: Molecular signatures for diagnosis of infection: application of microarray technology. Journal of Applied Microbiology. 2003, 96: 18-23. 10.1046/j.1365-2672.2003.02112.x.

Mocellin S, Wang E, Panelli M, Pilati P, Marincola FM: DNA array-based gene profiling in tumor immunology. Clinical Cancer Research. 2004, 10: 4597-4606. 10.1158/1078-0432.CCR-04-0327.

Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2005, 2: 171-178.

Strausberg RL: Tumor microenvironments, the immune system and cancer survival. Genome Biology. 2005, 6: 211.1-211.4. 10.1186/gb-2005-6-3-211.

Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nuclei Acids Research. 1991, 19: 6565-6572. 10.1093/nar/19.23.6565.

Jörnsten R, Yu B: Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics. 2002, 19: 1100-1109. 10.1093/bioinformatics/btg039.

Shaffer AL, Rosenwald A, Hurt EM, Giltnane JM, Lam LT, Pickeral OK, Staudt LM: Signatures of the immune response. Immunity. 2001, 15: 375-385. 10.1016/S1074-7613(01)00194-7.

Zhang M, Zhang D, Wells MT: Generalized shrinkage estimators adaptive to sparsity and asymmetry of high dimensional parameter spaces. Submitted. 2007

Zhang M, Montooth KL, Wells MT, Clark AG, Zhang D: Mapping multiple quantitative trait loci by Bayesian classification. Genetics. 2005, 169: 2305-2318. 10.1534/genetics.104.034181.

Zhang M, Zhang D, Wells MT: Variable selection for large

*p*small*n*regression models with incomplete data: mapping QTL with epistases. Submitted. 2007Zhang M: Inference for sparse and asymmetric signals in high dimensional data with applications to statistical genomics. PhD thesis. 2005, Cornell University, Department of Biological Statistics and Computational Biology

Cowles MK, Carlin BP: Markov Chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc. 1996, 91: 883-904. 10.2307/2291683.

## Acknowledgements

Thanks to Derick Peterson for providing the 20 simulated datasets. This research was partially supported by a startup fund from the Department of Statistics of Purdue University, and a summer faculty grant from the Purdue Research Foundation. The authors thank the editor, and two referees for insightful comments that greatly improved the manuscript.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Authors' contributions

DZ and MZ both contributed to the development of the modeling method. DZ wrote the Matlab code and did the simulation study. MZ analyzed the real data. Both authors read and approved the final manuscript.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Zhang, D., Zhang, M. Bayesian profiling of molecular signatures to predict event times.
*Theor Biol Med Model* **4**, 3 (2007). https://doi.org/10.1186/1742-4682-4-3

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1742-4682-4-3

### Keywords

- Event Time
- Bayesian Approach
- Molecular Feature
- Gibbs Sampler
- Linkage Cluster