ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data
© Wu et al; licensee BioMed Central Ltd. 2010
Received: 19 March 2010
Accepted: 3 June 2010
Published: 3 June 2010
ChIP-Seq is a powerful tool for identifying the interaction between genomic regulators and their bound DNAs, especially for locating transcription factor binding sites. However, high cost and high rate of false discovery of transcription factor binding sites identified from ChIP-Seq data significantly limit its application.
Here we report a new algorithm, ChIP-PaM, for identifying transcription factor target regions in ChIP-Seq datasets. This algorithm makes full use of a protein-DNA binding pattern by capitalizing on three lines of evidence: 1) the tag count modelling at the peak position, 2) pa ttern m atching of a specific tag count distribution, and 3) motif searching along the genome. A novel data-based two-step eFDR procedure is proposed to integrate the three lines of evidence to determine significantly enriched regions. Our algorithm requires no technical controls and efficiently discriminates falsely enriched regions from regions enriched by true transcription factor (TF) binding on the basis of ChIP-Seq data only. An analysis of real genomic data is presented to demonstrate our method.
In a comparison with other existing methods, we found that our algorithm provides more accurate binding site discovery while maintaining comparable statistical power.
Understanding of transcriptional regulation mechanisms is of fundamental importance to the study of biological processes such as development, drug response and disease pathogenesis . Through modulation of gene expression patterns, the differentiation and function of cells are tightly controlled. The on/off switch of specific gene expression is one of the main modulating mechanisms and is mainly through the association and disassociation of transcription factors (TFs) with their target gene promoters. Therefore, revealing the mechanism by which transcription factors regulate their target genes is essential to understanding many important biological processes. Several methods have been developed to identify the TF-target gene interactions and to investigate how and why cells respond to different signals. One such method, chromatin immunoprecipitation (ChIP) on a chip (ChIP-chip), is based on a tiling-array platform in which genomic DNA oligomers from gene promoters are pre-fixed. The DNA fragments immuno-precipitated from cell lysate by a TF antibody hybridize with the ChIP-chip array and TF-binding regions are identified by their high-intensity signals. Like all other array-based methods, however, this method can detect only targets included on the array.
More recently, with the advance of next-generation sequencing (NGS) technologies, ChIP-Seq has come into a wide use for transcription factor binding sites analysis. By directly sequencing the DNA fragments immunoprecipitated in a ChIP experiments, ChIP-Seq offers whole-genome coverage and greater sensitivity than the traditional ChIP-chip assay . Several analytic algorithms have been proposed for ChIP-Seq data, including ERANGE , FindPeaks , MACS, SISSRs, CisGenome , QuEST , Useq , SPP , PeakSeq, BayesPeak , and GLITR . Most of these algorithms aim to identify genomic regions enriched with ChIP-DNA fragments by using some negative control samples to remove/normalize some of the background noise from experimental procedures. For example, Robertson et al identified the enriched regions by detecting peaks on a tag density map, generated by extending each mapped tag in the 3' direction to the average length of the DNA fragments in the sequenced DNA library. The signal map is the integer count of the number of overlapping DNA fragments at each nucleotide position. The control sample was generated by another ChIP-Seq experiment using the same antibodies against the un-stimulated cells, in which the transcription factor of interest is inactive and located in cytoplasm. Rozowsky et al used the same idea of signal map, but used the raw input DNA as the control sample. Chen et al, studied a group of 13 transcription factors in E14 mouse ES cells by using a control sample obtained from another ChIP-Seq experiment with an irrelevant antibody, anti-GFP.
Although these negative controls are useful, they cannot account for an important source of noise signal - nonspecific DNA binding by TFs. This noise signal is difficult to control, as TFs must nonspecifically bind to DNA in order to efficiently access their unique binding sites among billions of nucleotides. Studies directly probing transcription factor dynamics at the single-molecule level in a living cell showed that TFs spend as much as 90% of their time non-specifically bound to and diffusing along DNAs . Like the ability to bind to specific targets, non-specific binding to DNA is a bona fide TF ability and therefore, this type of noise signal cannot be eliminated by using a negative control. A few algorithms, such as SISSRs, MACS and FindPeaks, can identify transcription factor binding targets solely on the basis of ChIP-Seq data, without the use of control samples. However, with the exception of SISSRs, these algorithms identify binding sites merely by the number of tag counts within a genomic region, ignoring the forward- and reverse- strand information. SISSRs utilizes a logic rule in which the sequenced forward and reverse strands should lie separately on the two sides of the binding site; therefore, the difference between the forward and reverse tag counts would change sign on the binding site. The SISSRs algorithm usually generates significantly more binding sites than other algorithms , but these may include many false discoveries, as will be shown in later section.
Here we describe a new algorithm that incorporates the forward- and reverse-strand information but employs it differently. Our algorithm, ChIP-PaM, is based on peak counts modeling and pattern matching of a specific tag count distribution of forward and reverse strands generated by protein-DNA binding, followed by de novo motif finding and searching within the potential binding regions. We show that our algorithm can greatly reduce false positive findings while maintaining or improving accuracy and statistical power for binding site discovery.
ChIP-PaM: Scoring the Enriched Genomic Regions in ChIP-Seq Data
- 1.Identify potential binding regions (PBRs) by using a pre-specified empirical False Discovery Rate (eFDR): The whole genome is divided into non-overlapping regions d bp in size and the unique tags in each region are counted. The frequencies of the tag counts are then tabulated and fitted to a Gamma-Poisson (G-P) mixture model that can accommodate the over-dispersion of the data. The G-P model showed better fitting to the real data than the frequently used Poisson model and has been used in other software (e.g., Cis-Genome ). The eFDRs for different tag count thresholds, defined as the ratio of the theoretical number of regions exceeding the thresholds by chance from the G-P model to the number of observed regions exceeding the same thresholds , can be calculated as the following
where k is a count threshold, Pr(count > k) is the probability of a tag count exceeding k in the G-P background model, Ntotal is the total number of regions in the whole genome and Nobserved(count > k) is the number of observed regions with count exceeding k. A tag count cut-off corresponding to the pre-specified FDR rate (α, e.g. 0.5) is then determined, and the genomic regions with tag counts greater than the cutoff are selected, and merged if adjacent, to form PBRs. The majority of the background tag signals are eliminated in this step, which can save immense amount of computing time in the following steps.
Evaluate the peak tag counts for each PBR: A sliding window of width d is used to scan the PBRs base pair by base pair and the forward and reverse tags within each window are counted to yield the forward and reverse tag distributions similar to Figure 1A and 1C. On the basis of the fitted G-P model, a p-value is calculated for the tag counts at the peak position within each PBR.
Identify the peak shift pattern from forward- to reverse- strand tag distributions: If a PBR contains a true TFBS, the difference between the forward- and reverse- strand tag counts will show a sinusoidal shape (Figure 1B, D). This shape is used for the pattern match and is identified by pattern recognition that employs a wavelet-based smoothing technique (see Methods for details). Dissimilarity scores comparing each PBR to a simulated reference pattern (SR) are computed and ranked.
De novo motif finding: PBRs with high peak counts and good sinusoidal pattern of forward to reverse tag count shift are considered as high-quality PBRs, i.e., those most likely contain a TF binding site. The peak positions of the forward- and reverse-strand count distribution within the high-quality PBRs are obtained, and the genomic sequence a few bp (e.g., 20 bp) upstream and downstream of the peak sites are retrieved to search for the de novo motif to which the TF might bind. Existing efficient algorithms such as MEME  are used for motif finding. A motif search algorithm typically generates several motifs, but only those contained by at least 25% of the input sequences are candidate motifs. Notice that each input sequence is only about 40 bp long; therefore it is reasonable to assume that each sequence contains either one motif or no motif.
Scan potential PBRs by using the de novo motif identified: A scoring matrix formed on the aligned consensus sequences identified in step4 is used to screen and score all PBRs identified in step 1. The smallest p-values corresponding to the best match to the scoring matrix in each PBR are retained as the PBR motif p-values.
Determine the significant regions: The PBR peak tag count p-values, pattern dissimilarity scores, and motif p values are integrated to re-rank the PBRs. Because the minimal p-value for the peak tag count may be as low as 10-30, whereas the minimal motif p values may be only 10-5, the scale difference for different score distributions is normalized. The p-values are log-transformed, rescaled to the same level and averaged to re-rank the PBRs. Finally, the top (1- α) re-ordered PBRs are selected as the significant TF target regions.
Characteristics of the ChIP-Seq data
To illustrate our method, we used a public ChIP-seq dataset (GSE12782) deposited at the GEO by Rozowsky et al. This dataset was originally analyzed by using PeakSeq. Several characteristics of the dataset are worth noting: 1) The raw input DNA is used as control; 2) The experiments were done in female Hela S3 cells; 3) The mitochondrial chromosomes (Chr M) were retained for sequencing; and 4) the transcription factor studied was STAT1, a TF with many well-known downstream target genes. These characteristics, while are not necessarily useful for the data analysis, provide a good opportunity to assess our proposed method, including an estimate of its false negative and false positive findings, as described below.
Summary statistics of the ChIP-Seq dataset.
Comparison of nuclear and cytoplasmic chromosomes.
noise) in ChIP*
In contrast to Chr M, chromosome Y (Chr Y) is another extreme with very low coverage, because the male Y chromosome is absent in the female Hela S3 cells. Therefore, any enriched regions identified on the Chr Y should be considered as false positives. The fact that some reads were mapped to Chr Y in the dataset suggests there were mapping/sequencing errors. The reads mapped to Chr Y cannot be explained by sequence homology to chromosome X, because only unique reads were mapped to the reference genome. For these reasons, Chr Y serves as a perfect internal negative control. As shown in Table 1, 12.7 thousand of the 24.3 million ChIP reads were mapped to Chr Y. Given that the length of Chr Y is 54.7 M and the whole genome is about 3 billion bps, at least 0.9 M (3.64%) reads in the ChIP sample are predicted to be wrongly mapped or sequenced.
STAT1 Targets Identified by ChIP-PaM in the ChIP-Seq Dataset
With a pre-specified eFDR level of 0.5 for the count data, regions containing six or fewer unique tag reads were eliminated, leaving 69,809 PBRs. For each PBR, a p-value based on the count at the regional peak was calculated from the fitted G-P model, and a dissimilarity score based on shape pattern matching was computed. These two values were used to select 190 regions with low count p-values and the best-matched shapes, which showed strong evidence of TF bindings. Short genomic sequences around the peak sites (± 20 bp) in the 190 region were retrieved for de novo motif finding by using MEME. Because each input sequence was very short (40 bp), we specified that each sequence contained either one motif or none. A motif was identified in 112 out of 190 regions (58.9%), and all of the PBRs were then scanned for this motif by using the MAST program . The de novo motif found strongly matched the STAT1 GAS motif previously identified and validated in biological experiments . From the MAST scan, a p-value for the best motif match was obtained for each PBR. Therefore, three values were associated with every PBR: a p-value based on count distribution (pc), a dissimilarity score based on shape pattern matching (dp), and a p-value based on motif matching (pm). The three scores were log-transformed and the log-dp and log-pm were scaled to the level of log-pc by regression. The PBRs were then re-ranked by averaging the three scores. Because the original eFDR was specified to be 0.5, this suggests the true positive rate is also 0.5 in all PBRs. Therefore, the top 34905 regions are considered significant target regions.
The pre-specified eFDR level is somewhat arbitrary. A general rule in choosing the eFDR is that it should be large enough to incorporate sufficient number of PBRs for re-ranking, yet not too large to include too many noisy regions. When we used another α of 0.7 to analyze the data, 117,479 PBRs were identified and 35243 (117479*(1-0.7)) were selected as significant regions. The number of significant regions resulting from the two eFDRs is almost identical, as although a higher eFDR (α) yields a larger pool of PBRs, a smaller rate of true discovery rate (1-α) offsets the initial large number in the final result. We found that an eFDR of 0.5 is a good choice because it can generate sufficient PBRs for further improvement while being computationally more efficient than higher eFDRs.
Comparison with Other Algorithms
We compared ChIP-PaM with SISSRs, PeakSeq and ChIP-PaM using tag counts information only. In the STAT1 ChIP-Seq dataset described above, PeakSeq identified 36,998 significant regions  and SISSRs identified 85,892 significant regions. To make the further comparison fair, we used the same number of top 36,998 regions for all algorithms.
Twenty-two genomic promoters have been experimentally validated to be regulated and bound by STAT1 protein upon IFN-γ stimulation . We used this information to compare the power of the algorithms. As shown in Figure 4B, PeakSeq, ChIP-PaM and ChIP-PaM using count only have almost identical "cumulative power"; they all detected a maximum of 14 of 24 positive promoters. However, the SISSRs had the least power to detect the known sites. In the rest 8 targets that were not identified by either method, a detailed look at their genomic regions found that essentially no reads were mapped in this ChIP-Seq sample, and therefore no algorithm can detect them. This suggests that ChIP-PaM is efficient in identifying the true STAT1 targets.
With the advance of the next-generation techniques, ChIP-Seq experiments are expected to be in great demand for the important biological studies of transcription regulatory network. Therefore, more efficient models and algorithms to analyze such data are urgently needed. Here we have proposed a new method of analysis of ChIP-Seq data that is based on ChIP-Seq sample only and that retains and even improves the accuracy and statistical power of binding site discovery.
There are four potential major mechanisms by which a region with enriched tags might be observed in ChIP-Seq data: (1) "true positive" TF target regions; (2) focal amplification of certain genomic regions; (3) nonspecific binding of TFs to the genome; and (4) random noise from experimental procedures. The difficulty arises in effectively separating tag enrichment resulting from (1) from those resulting from the others. Our algorithm is designed to improve the accuracy of finding in each instance. The Gamma-Poisson modeling takes account of the non-uniform noise background across the genome and helps to model both nonspecific TF binding and random noises. The use of pattern recognition to match a TF binding pattern will improve the detection of the true enrichment pattern. For example, enrichment induced by focal amplification shows a shape pattern (similar to Figure 6A) very different from that of enrichment by TF binding, and the pattern matching step of our algorithm can efficiently remove it from the final results. Furthermore, the de novo motif finding and searching step will help eliminate non-specific binding regions that do not contain the conserved motif sequences.
Because our proposed method incorporates three lines of evidence to determine the significant TF target regions, it is more robust than other current methods and detects fewer false positive bindings. However, it is often difficult to determine the statistical accuracy of the findings when multiple lines of evidence are integrated. In ChIP-PaM, we propose a novel data-based two-step FDR procedure to solve this problem. In this procedure, an eFDR (α) derived from the genome-wide tag count distribution is pre-specified to select the potential TF binding regions, and then the shape and motif information is incorporated to re-rank the selected PBRs. As the α level controls the overall FDR and re-ranking of the PBRs will not change it, the top (1- α) re-ordered candidate regions are considered to be significant. This data incorporation procedure can be potentially applied to other integrated analyses as well. Another advantage of our algorithm is that unlike other methods, in which the average length of the ChIP fragments must be estimated, ChIP-PaM makes use of the maximum length of the fragments, which is known from the experimental procedures. This should reduce the variation of the findings.
The entire analysis of the STAT1 example dataset took about 1.5 hours on a regular desktop computer. Therefore, our algorithm is computationally efficient. The majority of time is spent on the shape matching and motif identification steps. Only 5 minutes is needed to scan the PBRs and compute preliminary result from them. If the inference is to be made on the basis of the count data and only the genome scanning part is required, our algorithm would be extremely fast and might be algorithmically attractive for other applications, such as epigenetic analysis of modified histone proteins . Further investigation is needed in this direction.
Finally, one point worth noting is that although our algorithm requires no control sample, control samples may have an important role, depending on the scientific questions asked. For example, if a study were to compare the binding site difference of a TF in its "active" v.s. "inactive" form, a ChIP-Seq sample for the "inactive" TF would be a perfect control. In this case, two ChIP-PaM analyses would be performed independently on the two ChIP-Seq samples.
We propose a new algorithm, ChIP-PaM, for genome-wide identification of the transcription factor target genes by using the ChIP-Seq data. Unlike other methods of analyzing ChIP-Seq data, ChIP-PaM incorporates three lines of evidences, including tag count modeling at the peak position, pattern matching of a specific tag count distribution, and de novo motif finding and searching along the genome. A comparison with existing methods showed that our method can greatly improve the accuracy of binding site discovery while maintaining comparable statistical power.
Because information from sequencing procedures is used as parameter inputs of our algorithm, we will briefly describe the sequencing process. In ChIP-Seq sample preparation, genomic DNAs are either sonicated or digested into random fragments and size-selected (100-800 bp) for better sequencing accuracy. The ChIP-DNA fragments submitted for sequencing are therefore within a certain range of length, e.g. 150-250 bp. Owing to cost and technical reasons, typical tag reads acquired from sequencing apparatus for ChIP-Seq experiments are small (~30-60 bp), and consequently reflect only the two ends of the fragments, not the whole fragments. The end positions of fragments can be revealed by mapping the short-read tags back to a reference genome. For single-end sequencing, because the sequencing adaptors are ligated onto two ends of a fragment randomly, approximately equal numbers of tags are expected to be obtained in forward and reverse directions within a region.
If the tag reads from ChIP-Seq experiments are uniformly distributed on the genome that is divided into equal windows of d bp, basic probabilistic considerations imply that the distribution of unique tag counts in a certain window should obey the Poisson distribution. However, since the binding affinity of a TF to the whole genome is not uniform (i.e., specific and non-specific bindings), the ideal Poisson model will not be followed. It is expected that the whole count data will be a mixture of Poisson distributions. Assuming that the majority of DNA fragments are background signals, and that the significantly enriched regions reside largely in the right tail of the distribution, a zero-truncated Gamma-Poisson density can be employed as the background model for the count data. The Gamma-Poisson model is shown as below:
Zeros are not included in the model, as zero counts can occur simply because of non-mappability of certain regions and therefore does not truly reflect the zero counts in the model. The G-P model is employed to infer whether a specific region has a significantly high count. The mean count r(1-p)/p is positively correlated with the size of the window, i.e., the maximum fragment size d, and the sequencing depth of total number of reads obtained, and r(1-p)/p2 measures the dispersion of the data. There are several ways to estimate the model parameters . Here we employ the maximum likelihood method. The maximum-likelihood estimate of the parameters (r, p) was done by using the nonlinear optimization algorithms implemented in R .
Pattern Matching with Wavelet Smoothing
Simulation of a Reference Pattern
A TF ChIP-Seq experiment is simulated to generate the reference pattern used for the pattern matching. Suppose there are 3 × 106 DNA fragments with random length between 150 and 250 bp that are uniformly distributed on a 1 kb genomic region. Assume there is one TFBS located in the center of the region. Assume that the probability of being pulled down for sequencing is 0.1% for DNA fragments containing the TFBS, and 0.005% for fragments not containing the TFBS and that the two ends of one fragment have an equal probability of being sequenced. The tag count distributions for the forward and reverse strands along the genomic region are generated and used to build the reference pattern.
Motif Finding and Searching
MEME is one of the most widely used tools for de novo consensus motif finding http://meme.sdsc.edu/meme4_1/. Because MEME searches for motifs by performing Expectation-Maximization (EM) on a motif model, it is very time-consuming for large input datasets. We chose an input size of fewer than 500 sequences for motif finding, to minimize the drawbacks of too few sequences (not representative) or of a number too large to be computationally feasible. Only motifs contained in at least 25% of sequences submitted were considered to be canonical motifs and were used for subsequent motif searching. Also, as each submitted sequence for motif finding is approximately 40 bps, we expect that such short sequences will not contain more than one motif. Therefore, the maximum number of motifs in each sequence was set at 1 for MEME. Position-specific scoring matrices (PSSMs) generated by MEME were used as input for MAST, a sister program of MEME developed for motif alignment and searching. MAST compares the sequence from PESRs with the motifs identified from MEME and calculates match scores for each sequence. A p-value is generated to score the probability of each sequence that may contain the motif found.
We thank Ms. Sharon Naron in Scientific Editing at St Jude Children's Research Hospital for her professional editing support. This work was supported in part by the American Lebanese Syrian Associated Charities (ALSAC).
- Latchman DS: Eukaryotic Transcription Factors. 2008, Elsevier Ltd, fifthGoogle Scholar
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Gnome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651-7. 10.1038/nmeth1068.View ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-8. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008, 24: 1729-30. 10.1093/bioinformatics/btn305.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9: R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36: 5221-31. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-Seq data. Nat Biotechnol. 2008, 26: 1293-300. 10.1038/nbt.1505.PubMed CentralView ArticlePubMedGoogle Scholar
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008, 5: 829-34. 10.1038/nmeth.1246.PubMed CentralView ArticlePubMedGoogle Scholar
- Nix DA, Courdy SJ, Boucher KM: Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks. BMC Bioinformatics. 2008, 9: 523-10.1186/1471-2105-9-523.PubMed CentralView ArticlePubMedGoogle Scholar
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-Seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 6: 1351-9. 10.1038/nbt.1508.View ArticleGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-Seq experiments relative to controls. Nature Biotechnol. 2009, 27: 66-75. 10.1038/nbt.1518.View ArticleGoogle Scholar
- Spyrou C, Stark R, Lynch AG, Tavaré S: BayesPeak: Bayesian analysis of ChIP-Seq data. BMC Bioinformatics. 2009, 10: 299-10.1186/1471-2105-10-299.PubMed CentralView ArticlePubMedGoogle Scholar
- Tuteja G, White P, Schug J, Kaestner KH: Extracting transcription factor targets from ChIP-Seq data. Nucleic Acids Res. 2009, 37: e113-10.1093/nar/gkp536.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133: 1106-17. 10.1016/j.cell.2008.04.043.View ArticlePubMedGoogle Scholar
- Elf J, Li GW, Xie XS: Probing transcription factor dynamics at the single-molecule level in a living cell. Science. 2007, 316: 1191-4. 10.1126/science.1141967.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. 1994, AAAI Press, 28-36.Google Scholar
- Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998, 14: 48-54. 10.1093/bioinformatics/14.1.48.View ArticlePubMedGoogle Scholar
- Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research. 2006, 34: W369-W373. 10.1093/nar/gkl198.PubMed CentralView ArticlePubMedGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.View ArticleGoogle Scholar
- Miller FJ, Rosenfeldt FL, Zhang C, Linnane AW, Nagley P: Precise determination of mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR-based assay: lack of change of copy number with age. Nucleic Acids Research. 2003, 31: e61-10.1093/nar/gng060.PubMed CentralView ArticlePubMedGoogle Scholar
- Baker M: Epigenome: mapping in motion. Nature Methods. 2010, 7: 181-186. 10.1038/nmeth0310-181.View ArticleGoogle Scholar
- McCullagh P, Nelder JA: Generalized Linear Models, second ed. 1989, Chapman & Hall, LondonView ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2009,http://www.R-project.orghttp://www.R-project.orgGoogle Scholar
- Percival DB, Walden AT: Wavelet Methods for Time Series Analysis. 2000, Cambridge University PressView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.