An information transmission model for transcription factor binding at regulatory DNA sites
- Mingfeng Tan†^{1},
- Dong Yu†^{1},
- Yuan Jin†^{1},
- Lei Dou^{2},
- Beiping LI^{1},
- Yuelan Wang^{1},
- Junjie Yue^{1}Email author and
- Long Liang^{1}Email author
https://doi.org/10.1186/1742-4682-9-19
© Tan et al.; licensee BioMed Central Ltd. 2012
Received: 8 April 2012
Accepted: 17 May 2012
Published: 6 June 2012
Abstract
Background
Computational identification of transcription factor binding sites (TFBSs) is a rapid, cost-efficient way to locate unknown regulatory elements. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. To date, identifying TFBSs with high sensitivity and specificity is still an open challenge, necessitating the development of novel models for predicting transcription factor-binding regulatory DNA elements.
Results
Based on the information theory, we propose a model for transcription factor binding of regulatory DNA sites. Our model incorporates position interdependencies in effective ways. The model computes the information transferred (TI) between the transcription factor and the TFBS during the binding process and uses TI as the criterion to determine whether the sequence motif is a possible TFBS. Based on this model, we developed a computational method to identify TFBSs. By theoretically proving and testing our model using both real and artificial data, we found that our model provides highly accurate predictive results.
Conclusions
In this study, we present a novel model for transcription factor binding regulatory DNA sites. The model can provide an increased ability to detect TFBSs.
Background
The transcription of genes is controlled by transcription factors (TFs), which bind to short DNA motifs that are known as transcription factor binding sites (TFBSs). Identification of TFBSs lies not only at the very heart of expanding our knowledge of regulatory elements in the genome by helping to decode genomic data, discover regulatory patterns in gene expression, and establish transcription regulatory networks, but also of explaining the origins of organismal complexity and development [1]. Computational identification of TFBSs is a rapid, cost-efficient way to locate unknown TFBSs. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. However, DNA regulatory elements are frequently short and variable, making the computational identification of them a challenging problem because the real TFBSs might be easily lost in random DNA sequences, i.e., the “background noise”.
To date, many models have been developed for transcription factor binding of regulatory DNA sites, and based on those models, numerous computational algorithms have been established to identify TFBSs. Several studies have utilised the structural information of DNA and protein to build predictive models for DNA binding sites [2–5]. These algorithms are able to identify previously uncharacterised binding sites for TFs and have improved performance over simple sequence profile models [6]. However, these algorithms have not been generally used because their parameters depend on the knowledge of the solved protein-DNA complex structures, which is a limited data set.
Several methods use pattern recognition algorithms derived from computer science or other research areas. These methods include support vector machines (SVMs) [7], self-organising maps (SOMs) [8], and Bayesian networks [9]. These algorithms can automatically provide objective and non-user-defined thresholds by training the programme with known data. Nevertheless, the biggest limitation of these methods might be the lack of explicitly biochemical or biophysical explanations.
Currently, position weight matrix (PWM) is the most common model for TFBS recognition. Many methods or programmes are based on the PWM model or its expansion, such as Match [10], the expectation–maximisation (EM) algorithm [11], and the stochastic variant of EM, the Gibbs sampling method [12, 13]. In PWM, an L-long sequence motif is represented by a 4*L matrix, with weights giving the frequency of the four DNA bases (or the logarithm) in each of the L positions [6, 14, 15]. The basic PWM model is based on the biophysical considerations of protein–DNA interactions and uses the relative entropy, which is also known as the information content, as the criterion to determine whether an input sequence is a TFBS. According to this theory, the affinity between the factor and its TFBS is related to the free energy, which correlates with the relative entropy [6, 14, 15]. Therefore, in order for a sequence to be a TFBS, it must have higher relative entropy. Consequently, the relative entropy can be used as the criterion to detect a TFBS.
The PWM approach assumes that the contribution of each nucleotide position within a TFBS to the free energy is independent and that the effect on the binding strength is cumulative. We call this hypothesis the “independent hypothesis” because it supposes that each base of the motif is independent of the others. Methods based on the independent hypothesis are simple and have small numbers of parameters, making them easy to implement. These methods are widely used and often considered acceptable models for binding-site predictions.
The PWM model can suffer from high false-positive (FP) rates if motifs are degenerate. In addition, in some real cases, the affinity between factors and their TFBSs is weak, causing a high false-negative (FN) rate while using these methods. More importantly, the independent hypothesis can lead to deviations in the scoring mechanism and produce inaccurate results. Experimental evidence [16–20] suggests that there is interdependence among positions in the binding sites, which has prompted the development of models that incorporate position dependencies. The related methods include Bayesian networks [21], permuted Markov models [22], Markov chain optimisation [23], hidden Markov models [24], non-parametric models [25], and generalised weight matrix models [26]. Methods based on position-dependency models usually have better binding site prediction accuracy with lower FP rates. However, these methods require more complicated mathematical tools with more parameters to estimate and more experimental data than are typically available [27].
Orthogonal information from comparative genomics and information on co-regulation at the transcriptional level have also been integrated into these methods to identify cis-regulatory sites [28–31]. Methods have also been proposed to discover the composite regulatory module (CMA) [32, 33]. Because most of these methods rely on the basic algorithms proposed previously, their performances are mainly determined by these basic algorithms.
Therefore, although significant progress has been made, the accuracy of the computational identification of TFBSs can still be improved. To tackle the general problem of binding site identification in the absence of high-throughput experimental data, theoretical models of binding sites are still required.
One aim of this work is to develop a new model that incorporates position interdependencies in effective ways to improve the computational prediction of TFBSs. Based on information theory [34, 35], in this study, we propose a novel computational model. By theoretically proving and testing our model using both real and artificial data, we find that our model gives highly accurate predictive results.
Information transmission model
An information transfer model for TFBS binding
Suppose that a transcription factor and its known TFBSs are aligned by an appropriate algorithm. In this study, we use L to represent the length of the aligned motif, j to represent the base position and p_{ j }(i) to represent the occurrence probability that the base i (A, T, C or G) appears at the position j according to the motif.
Once the TI of an input sequence is larger than MTI, then it is accepted as a possible TFBS.
Enhancement of the model to be universal
The independent model might lead to inaccurate predictive results. In this section, we discuss in detail how this can happen by example and in theory and how we enhanced our model to be independent of this hypothesis.
Based on this observation, we propose a formal definition of positive and negative correlations of the bases in a motif: if ${P}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{m}\right)}>{P}_{seq\left({{i}^{\prime}}_{1}\right),\cdots ,seq\left({{i}^{\prime}}_{k}\right)}\ast {P}_{seq\left({{i}^{\prime}}_{k+1}\right),\cdots ,seq\left({{i}^{\prime}}_{m}\right)}$, then $seq\left({{i}^{\prime}}_{1}\right),\cdots ,seq\left({{i}^{\prime}}_{k}\right)$ and $seq\left({{i}^{\prime}}_{k+1}\right),\cdots ,seq\left({{i}^{\prime}}_{r}\right)$ are positively correlated. If they are equal, then $seq\left({{i}^{\prime}}_{1}\right),\cdots ,seq\left({{i}^{\prime}}_{k}\right)$ and $seq\left({{i}^{\prime}}_{k+1}\right),\cdots ,seq\left({{i}^{\prime}}_{r}\right)$are independent; otherwise, they are negatively correlated. In this formula, seq(i) is the i th base of the sequence seq. For example, the 4^{th} and 6^{th} bases are positively correlated and contain no more or less information than only the individual base. Therefore, the independent hypothesis leads to an inaccurate estimation of the TI, thereby making an erroneous prediction of the TFBS. Similarly, the use of other methods that are based on the independent hypothesis also results in incorrect scores and leads to inaccurate predictive results. To avoid this inaccuracy, the model was enhanced to address the correlations such that it is capable of determining the correct TI despite the inaccuracy of the independent hypothesis.
First, we know that after binding of the TF to the TFBS, the information encoded by the TFBS seq, is$IS2={I}_{L}=-{log}_{2}{p}_{\mathit{seq}}=-{log}_{2}{p}_{seq\left(1\right)}{}_{,\cdots ,seq\left(L\right)}$. In this equation, ${p}_{\mathit{seq}}={p}_{seq\left(1\right)}{}_{,\cdots ,seq\left(L\right)}$ is the occurrence probability of $seq=seq\left(1\right),\cdots ,seq\left(L\right)$ versus all of the TFBSs of the TF. Due to unknown TFBSs and lack of statistical data of the known TFBSs, we cannot determine p_{ seq } or I_{ L } directly, but these terms can be estimated from the known TFBSs.
We use the information of r- base sub-sequences $seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)\left({i}_{1}>{i}_{2}>\cdots >{i}_{r}\right)$ to estimate the information of the full sequence. The probability of a r- base sub-sequence,${p}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)}$, can be approximated as ${\tilde{p}}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)}$ by investigating the known TFBS, and in the following steps, we assume that ${p}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)}$ and ${\tilde{p}}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)}$ are the same. Therefore, $-{log}_{2}{P}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)}$ is the information of the sub-sequence.
This probability can reveal the correlation among these bases. For example, if base i_{1} is fully and positively correlated with base i_{ k }, as the 4^{th} and 6^{th} positions in Figure 2 are, then ${p}_{seq\left({i}_{1}\right)seq\left({i}_{k}\right)}={p}_{seq\left({i}_{1}\right)}={p}_{seq\left({i}_{k}\right)}>{p}_{seq\left({i}_{1}\right)}{p}_{seq\left({i}_{k}\right)}$; hence, the information of these two bases is ${I}_{{i}_{1},{i}_{k}}=-{log}_{2}{P}_{seq\left({i}_{1}\right)}=\frac{1}{2}{I}_{\mathit{independent}}$ (i.e., it is only half of the information of the independent situation). If base i_{1} is independent from base i_{ k }, then ${p}_{seq\left({i}_{1}\right)seq\left({i}_{k}\right)}={p}_{seq\left({i}_{1}\right)}\times {p}_{seq\left({i}_{k}\right)}$ and ${I}_{{i}_{1},{i}_{k}}=-{log}_{2}\left({P}_{seq\left({i}_{1}\right)}\times {P}_{seq\left({i}_{k}\right)}\right)={I}_{\mathit{independent}}$.
If we assume that ${I}_{r+1}\ge {I}_{r}$, then we immediately obtain ${I}_{r+1}\ge {I}_{1}$, which contradicts (11). Therefore, it must be the case that ${I}_{r+1}<{I}_{r}$. Hence, (10) is proved.
Therefore, conversely, the tendency of I_{ r } can be used to judge if the correlation is positive, negative, or independent. We now know that I_{ independent } would overestimate (when the correlation is positive) or underestimate (when the correlation is negative) the I_{ L } if the independent hypothesis is not true. Again, this finding can explain why using the independent hypothesis can lead to inaccurately predicted results. More importantly, from (13) and (15), we know that I_{ r } is more accurate than I_{ independent } when r ≥ 2. So, we can use I_{ r } (r ≥ 2) to estimate the information and obtain the predictive results with more accuracy.
In this equation, ${q}_{seq\left({i}_{1}\right),\cdots ,seq\left({i}_{r}\right)}$ is the background correlation probability, calculated as described previously.
Once TI_{ r } ≥ MTI_{ r } (factor), then seq is accepted as a possible TFBS.
Results
Performance in Saccharomyces cerevisiae promoter regions
We tested our model by calculating the TI for all of the known TFBSs of 10 well-characterised transcription factors in the yeast S. cerevisiae promoter database (SCPD) [36]. We found that most of the TFBSs have a TI larger than 0. This evidence strongly supports our TI hypothesis that the information is transferred from the TFBS to the factor, and binding of the TF to the TFBS only happens if enough information is transferred.
In this study, we illustrate several snapshots of TI by scanning several sequences of S. cerevisiae. These sequences cover the coding regions, the regulatory regions and the “flank” regions.
A comparison of the TI model with other methods
To illustrate the performance of the information transmission model, we implemented this novel model with a programme named tfbsInfoScanner and compared it with commonly used motif identification programmes, such as SOMBRERO, MEME and AlignACE. Mahony et al. [17] proposed the TFBS prediction method SOMBRERO and compared the results derived from SOMBRERO with those from two popular motif finding programmes, MEME [37] and AlignACE [11]. These researchers used the same real data set that we used. To efficiently analyse the performance of our method and to avoid repetitive and time-consuming computation, we used the same real sequence data set and compared results derived from our method to those obtained from SOMBRERO, MEME and AlignACE.
Performance comparison between our TI method ( r = 3) and three other programmes: SOMBRERO, MEME and AlignACE
Factor | abf1 | csre | gal4 | gcn4 | gcr1 | hstf | mat | mcb | mig1 | pho2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
SOMBRERO | FP | 0.56 | 0.727 | 0.235 | 0.286 | 0.69 | 0.571 | 0.25 | 0.645 | 0.68 | 0.909 |
FN | 0.45 | 0.25 | 0.071 | 0.6 | 0.222 | 0.111 | 0.308 | 0.083 | 0.2 | 0.5 | |
MEME | FP | 0.182 | 0.667 | 0.167 | 0.8 | 0.444 | 0.75 | 0.267 | 0.25 | 1 | 1 |
FN | 0.55 | 0.5 | 0.286 | 0.92 | 0.444 | 0.333 | 0.154 | 0.25 | 1 | 1 | |
AlignACE | FP | 0.375 | 0.824 | 0.083 | 0.444 | 0.625 | 0.556 | 0 | 0.083 | 0.909 | 1 |
FN | 0.5 | 0.25 | 0.214 | 0.6 | 0.333 | 0.111 | 0.308 | 0.083 | 0.9 | 1 | |
TI model with | FP | 0 | 0 | 0 | 0.182 | 0.333 | 0 | 0 | 0 | 0 | 0 |
25% known TFBS as training set | FN | 0.727 | 0.5 | 0.643 | 0.259 | 0.692 | 0.667 | 0.526 | 0.333 | 0.429 | 0.5 |
TI model with | FP | 0 | 0.333 | 0 | 0.226 | 0.143 | 0 | 0.294 | 0 | 0 | 0 |
50% known TFBS as training set | FN | 0.455 | 0 | 0.286 | 0.037 | 0.308 | 0.5 | 0.158 | 0.083 | 0.214 | 0.375 |
TI model with | FP | 0 | 0 | 0 | 0.25 | 0.615 | 0.783 | 0.25 | 0 | 0 | 0.25 |
75% known TFBS as training set | FN | 0.182 | 0 | 0.143 | 0.037 | 0.077 | 0 | 0.158 | 0 | 0.143 | 0.125 |
TI model with | FP | 0 | 0 | 0 | 0.265 | 0.577 | 0.526 | 0.222 | 0 | 0 | 0.571 |
100% known TFBS as training set | FN | 0 | 0 | 0 | 0 | 0 | 0.167 | 0.053 | 0 | 0.071 | 0 |
Performance on artificial sequences
To examine the performance of our method in discovering “unknown” TFBSs, we subsequently trained our method with all of the known TFBSs and embedded the artificial sequences with pseudo-motifs. Similar to Mahony et al. [17], we also generated three artificial test set, although using our own method. In the artificial test set used by Mahony et al., each set comprises 10 data sets, each of which comprises 10 sequences; each sequence harbours a random number of occurrences (0 ~ 3) for each of the binding motifs for gcn4 gal4 and mat1 (generated from PWMs). The total lengths of these three sets of 100 sequences are 4500, 8000 and 12500 bp, respectively. The average length of one sequence is therefore 45 bp, 80 bp or 125 bp, but each sequence harbours at most 9 occurrences of the motifs. We believe this number of occurrences may be too dense, and perhaps a high occurrence of pseudo-TFBSs may be encoded by these sequences.
In our modified method, we also generated three artificial test sets with different sequence lengths (450, 800 and 1250 bp), and each test set consists of 10 sequences that were randomly generated according to the GC content of S. cerevisiae. Each sequence harbours a random number of occurrences (0 ~ 3) for each of the binding motifs for gcn4 gal4 and mcb (randomly generated from PWMs). Mahony et al. [17] used mat1 as a test object, but in the new version of SCPD, the TFBS of mat1 is split into mat1_alpha and mat1_beta; therefore, we arbitrarily chose mcb as a substitute for mat1. This test set is more rigorous because these artificial sequences are 10 times longer, leading to an increase in the number of random sequences, which may result in a higher FP rate. As our method is still under development, in this test, the pseudo-TFBSs are also generated from the PWMs. Because the PWM method assumes that the independent hypothesis is true, these pseudo-TFBSs cannot correctly indicate correlation among the bases. This deficiency might lead to a lower TI, and, therefore, some pseudo-TFBSs may not be identified by our method. However, we can investigate what happens when scanning these artificial sequences.
Average performance of the artificial sequence data set (r = 3), perf = (kz∩P)/(K∪P), where K is the set of known motif sites and P is the set of predicted motif sites[30]
Length | Index | gcn4 | gal4 | mcb | Average |
---|---|---|---|---|---|
FP | 0.75 | 0 | 0.647 | 0.466 | |
450*10 | FN | 0.083 | 1 | 0.143 | 0.409 |
perf | 0.244 | 0 | 0.333 | 0.192 | |
PT/ RT | 3.667 | 0 | 2.429 | 2.032 | |
FP | 0.892 | 0 | 0.75 | 0.547 | |
800*10 | FN | 0.2 | 1 | 0.333 | 0.511 |
perf | 0.105 | 0 | 0.222 | 0.109 | |
PT/ RT | 7.4 | 0 | 2.667 | 3.356 | |
FP | 0.936 | 0 | 0.756 | 0.564 | |
1250*10 | FN | 0.25 | 0.875 | 0.286 | 0.470 |
perf | 0.063 | 0.143 | 0.222 | 0.143 | |
PT/ RT | 11.75 | 0.143 | 2.929 | 4.941 |
Average performance of the artificial sequence data set (r = 1)
Length | Index | gcn4 | gal4 | mcb | average |
---|---|---|---|---|---|
FP | 0.894 | 0.308 | 0.917 | 0.706 | |
450*10 | FN | 0 | 0 | 0 | 0 |
perf | 0.098 | 0.692 | 0.073 | 0.288 | |
PT/ RT | 10.25 | 1.444 | 13.714 | 8.469 | |
FP | 0.953 | 0.176 | 0.943 | 0.691 | |
800*10 | FN | 0 | 0 | 0 | 0 |
perf | 0.047 | 0.824 | 0.057 | 0.309 | |
PT/ RT | 21.1 | 1.214 | 17.583 | 13.299 | |
FP | 0.975 | 0.125 | 0.951 | 0.684 | |
1250*10 | FN | 0.125 | 0 | 0 | 0.042 |
perf | 0.024 | 0.875 | 0.049 | 0.316 | |
PT/ RT | 35.625 | 1.143 | 20.286 | 19.018 |
Discussion
During evolution, regulatory instructions or information were encoded in the DNA sequence. Redundant coding (or correlated coding) is utilised to ensure that the important regulatory information will be inherited and transferred correctly. During the binding process, the transcription factor reads the regulatory instructions from the TFBS and subsequently guides transcription according to the regulatory information. In other words, the factor reading the special regulatory instruction from the TFBS then instructs the transcription according to the regulatory information obtained from the TFBS. Nucleic acids needed to be coded in a redundant manner to ensure that the regulatory information can be transferred correctly, and therefore these sites are not independent of the others.
With our model, for the sequences encoding motifs, such as TFBSs, the input sequences can be scanned, and the sub-sequences for which the TI is greater than the MTI of the motif can be taken as the predictive hits.
In our observations, most of the real TFBSs had a positive correlations because with the positively correlated coding, the information that they contained decreased accordingly, but the information was transferred correctly.
Interestingly, we find that if there is a real TFBS encoded by one strand, then there often are peaks on both strands, but the peaks on the opposite strand are usually lower. We think this phenomenon happens for two reasons: first, certain factors bind to their TFBS by inserting a domain into the DNA grooves. In this case, both strands of DNA could have physical contact with the transcription factor; hence, both sides could transfer the regulatory information to the factor, which is detected by our method. Second, it is not known from which strand the background noise comes. Therefore, for example, for r = 2, the occurrence probability of AG equals TC. Therefore, the complementary strand of a real TFBS can have a high TI.
Furthermore, this information transmission model has the potential to be useful in other research areas, for example, in the computational identification of other motifs.
Concluding remarks
In this work, we present a novel model for transcription factor binding regulatory DNA sites. This information transmission model is based on information theory and effectively incorporates position interdependencies. By testing the model on both real and artificial data sets, we have illustrated that our method is efficient at predicting unknown TFBSs.
Materials and methods
Data set preparation
The TFBSs of the 11 TFs and regulatory region sequences were obtained from the yeast S. cerevisiae Promoter Database (SCPD, http://rulai.cshl.edu/SCPD) [28]. This data set includes 68 regulatory regions with a total length of 30299 bp. These sequences harbour 309 experimentally mapped TFBS, including 141 real TFBSs of the 11 TFs. The chromosome sequences of S. cerevisiae were obtained from the National Center for Biotechnology Information (NCBI) reference sequence database.
The artificial sequences used in the test were randomly generated, taking into account the GC content of the S. cerevisiae genome. The pseudo-TFBSs of gcn4, gal4 and mcb were randomly generated from PWMs. We did not generate the correlated TFBSs directly because it is difficult to make the pseudo-TFBSs conform to the correlation relationships, as real TFBSs do.
Background probabilities calculation
Background probabilities are used to estimate the information carried by the TFBS before the binding event. An L- base window slides through the chromosomes, and all of the r-base sub-sequences in this window are counted. After the scanning, 4^{r} probabilities are calculated for all the 4^{r} possible r-base sub-sequences. This computation is time-consuming, but once the background probabilities are worked out, they can be reused in all of the TFBS predictions of this species without being recalculated.
Sequence alignment
The TFBSs of the TFs were separately aligned by the ClustalW multiple alignment programme with the default argument, and the aligned TFBSs and the background probabilities were used to calculate the MTI.
Computation environment
The novel method was implemented with a programme named tfbsInfoScaner, which was written in standard C++. This programme can be run on different computer platforms, and the full source code is available free for non-commercial use upon request by contacting the authors. Our test was run on a 64-CPU Altix 3700 server (Silicon Graphics, Mountain View, CA).
Notes
Declarations
Authors’ Affiliations
References
- GuhaThakurta D: Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 2006, 34: 3585-3598.PubMed CentralView ArticlePubMedGoogle Scholar
- Kono H, Sarai A: Structure-based prediction of DNA target sites by regulatory proteins. Proteins. 1999, 35: 114-131.View ArticlePubMedGoogle Scholar
- Steffen NR, Murphy SD, Tolleri L, Hatfield GW, Lathrop RH: DNA sequence and structure: direct and indirect recognition in protein-DNA binding. Bioinformatics. 2002, 18: S22-S30.View ArticlePubMedGoogle Scholar
- Morozov AV, Havranek JJ, Baker D, Siggia ED: Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005, 33: 5781-5798.PubMed CentralView ArticlePubMedGoogle Scholar
- Siggers TW, Honig B: Structure-based prediction of C2H2 zinc-finger binding specificity: sensitivity to docking geometry. Nucleic Acids Res. 2007, 35: 1085-1097.PubMed CentralView ArticlePubMedGoogle Scholar
- Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987, 193: 723-750.View ArticlePubMedGoogle Scholar
- Djordjevic M, Sengupta AM, Shraiman BI: A biophysical approach to transcription factor binding site discovery. Genome Res. 2003, 13: 2381-2390.PubMed CentralView ArticlePubMedGoogle Scholar
- Mahony S, Hendrix D, Golden A, Rokhsar DS: Transcription factor binding site identification using the self-organizing map. Bioinformatics. 2005, 21: 1807-1814.View ArticlePubMedGoogle Scholar
- Makita Y, De Hoon MJ, Ogasawara N, Miyano S, Nakai K: Bayesian joint prediction of associated transcription factors in Bacillus subtilis. Pac Symp Biocomput. 2005, 10: 507-518.Google Scholar
- Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV: MATCH, et al: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003, 31: 3576-3579.PubMed CentralView ArticlePubMedGoogle Scholar
- Cardon LR, Stormo GD, et al: Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J Mol Biol. 1992, 223: 159-170.View ArticlePubMedGoogle Scholar
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214.View ArticlePubMedGoogle Scholar
- Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214.View ArticlePubMedGoogle Scholar
- Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences. J Mol Biol. 1986, 188: 415-431.View ArticlePubMedGoogle Scholar
- Stormo GD, Fields DS: Specificity, free energy andinformation content in protein-DNA interactions. Trends Biochem Sci. 1998, 23: 109-113.View ArticlePubMedGoogle Scholar
- Benos PV, et al: Probabilistic code for DNA recognition by proteins of the EGR family. J Mol Biol. 2002, 323: 701-727.View ArticlePubMedGoogle Scholar
- Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002, 3: 1255-1261.View ArticleGoogle Scholar
- Man T-K, Stormo GD: Non-independence of Mnt repressor–operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 2001, 29: 2471-2478.PubMed CentralView ArticlePubMedGoogle Scholar
- Udalova IA, et al: Quantitative prediction of NF-kappa B DNA-protein interactions. Proc Natl Acad Sci USA. 2002, 99: 8167-8172.PubMed CentralView ArticlePubMedGoogle Scholar
- Wolfe SA, et al: Analysis of zinc fingers optimized via phage display: evaluating the utility of a recognition code. J Mol Biol. 1999, 285: 1917-1934.View ArticlePubMedGoogle Scholar
- Barash Y, et al: Modeling dependencies in protein-DNA binding sites. Proceedings of RECOMB-03. 2003, , , 28-37.View ArticleGoogle Scholar
- Zhao X, et al: Finding short DNA motifs using permuted Markov models. J Comput Biol. 2005, 12: 894-906.View ArticlePubMedGoogle Scholar
- Ellrott K, et al: Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics. 2002, 18 (Suppl. 2): S100-S109.View ArticlePubMedGoogle Scholar
- Marinescu VD: MAPPER, et al: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes. BMC Bioinforma. 2005, 6: 79-View ArticleGoogle Scholar
- King OD, Roth FP: A non-parametric model for transcription factor binding sites. Nucleic Acids Res. 2003, 31: e116-PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou Q, Liu JS: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics. 2004, 20: 909-916.View ArticlePubMedGoogle Scholar
- Tomovic A, Oakeley EJ: Position dependencies in transcription factor binding sites. Bioinformatics. 2007, 23: 933-941.View ArticlePubMedGoogle Scholar
- Bussemaker HJ, Li H, Siggia ED: Regulatory elementdetection using correlation with expression. Nature Genet. 2001, 27: 167-171.View ArticlePubMedGoogle Scholar
- Cooper GM, Sidow A: Genomic regulatory regions:insights from comparative sequence analysis. Curr Opin Genet Dev. 2003, 13: 604-610.View ArticlePubMedGoogle Scholar
- Defrance M, Touzet H: Predicting transcription factor binding sites using local over-representation and comparative genomics. BMC Bioinforma. 2006, 7: 396-View ArticleGoogle Scholar
- Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, et al: Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res. 2006, 16: 656-668.PubMed CentralView ArticlePubMedGoogle Scholar
- Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B: Computational detection of cis-regulatory modules. Bioinformatics. 2003, 19: II5-II14.View ArticlePubMedGoogle Scholar
- Jegga AG, Gupta A, Gowrisankar S, Deshmukh MA, Connolly S, et al: CisMols analyzer: identification of compositionally similar cis-element clusters in ortholog conserved regions of coordinately expressed genes. Nucleic Acids Res. 2005, 33: W408-W411.PubMed CentralView ArticlePubMedGoogle Scholar
- Shannon CE: A mathematical theory of communication (Part 1). Bell System Technical Journal. 1948, 27: 379-423.View ArticleGoogle Scholar
- Shannon CE: A mathematical theory of communication (Part 2). Bell System Technical Journal. 1948, 27: 623-656.View ArticleGoogle Scholar
- Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999, 15: 607-611.View ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994, 2: 28-36.PubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.