 Research
 Open Access
 Published:
Method for generating multiple risky barcodes of complex diseases using ant colony algorithm
Theoretical Biology and Medical Modellingvolume 14, Article number: 4 (2017)
Abstract
Background
Susceptible barcode recognition plays an important role in the diagnosis and treatment of complex diseases. Numerous approaches have been proposed to identify risky barcodes involved in the progress of complex diseases. However, some methods only consider differences in barcode frequencies between the control and disease groups; as such, these methods may be partial or even wrong. For example, some barcodes with a high risk ratio yield a low frequency on cases or exhibit a high frequency on controls, which may unreasonable from a statistical point.
Results
In our study, a stricter criteria, maximum discrepancy and maximum constituency, is designed to evaluate each barcode and ant colony algorithm is used to search combination space of epistasis. For complex diseases with multisubtypes, our method can list several potential barcodes contributing to different subtypes of complex diseases. Another contribution of this work is to introduce a method for determining the length of barcodes and excluding noisy barcodes whose frequencies are abnormal. In addition, common pathogenic genes shared by different risky barcodes are also recognized, which may provide key clue for further study, such as gene function analysis.
Conclusions
Experimental results reveal that our method can find multiple risky barcodes whose risk ratio and odds ratio are >1. These barcodes could be related to different subtypes of complex diseases.
Background
Single nucleotide polymorphisms (SNPs) play an important role in the diversity of phenotypes [1]. Genomewide association studies (GWAS) have been applied to reveal the effect of SNPs on complex diseases and to identify candidate susceptible genes, which are statistically correlated with specific complex diseases. After few candidate SNPs have been recognized through GWAS, risky SNP barcodes regarded as a genotype in multiple SNPs should be identified [2]. For case–control study with haplotype samples, the barcode is the same as haplotype in given SNPs. However, the barcode is also suitable for association study on genotype datasets. If an individual carries a diseasespecific barcode, the individual can be regarded as a potential patient with a relatively high possibility. Thus, further analysis and personalized medicines can be successfully launched.
The computational complexity of analyzing SNP interactions has been confirmed as a NP hard problem [3, 4]. As such, a global optimal solution is difficult to be located in a combination space, especially when the number of SNPs is large. Numerous methods were proposed to balance the quality of solutions and calculation costs, and many approaches tried to find near optimal solutions in such a huge combination space. For example, intelligent algorithms can be suitable; Chang et al. [2] used a particle swarm optimization algorithm to search for barcodes with a high risk. Yang et al. [3] improved a genetic algorithm by using a selective strategy to optimize the maximum difference. A heuristic search strategy can also be applied to analyze SNP interactions. For instance, Chuang et al. [4] designed a branchandbound algorithm to narrow down the combination space of SNP interactions.
Although these approaches have some merits, some issues should be addressed in further. For instance, some approaches only focus on the difference in barcodes frequencies between cases and controls. In our study, the consistency of barcodes is simultaneously considered to identify the risky barcode related to these cases. In another instance, some methods intend to search for a specific barcode, which is globally optimal [5]. However, these cases can be precisely divided into several subtypes on the basis of the trait heterogeneity of a complex disease [6]. It means that there may be several different risky barcodes related to different subtypes of complex diseases. In our study, an improved ant colony algorithm is proposed to address these problems and to generate multiple solutions; in each iteration, optimum solutions are saved as candidate barcodes attributed to a specific subtype of complex diseases. More importantly, traditional methods make the preference that odds ratio (OR) or risk ratio (RR) produces plausible barcodes, which yield a high OR but exhibit a low frequency in the case group or a high frequency in the control group.
Our study aims to address all these three issues in the process of generating risky barcodes. To search feasible solutions in the huge combination space of SNP interactions, an ant colony optimization (ACO) is applied to identify multiple risky barcodes with criteria maximum discrepancy and maximum consistency (MDMC). Note that we provide a new way to determine the length of risky barcodes according to the frequency of barcode. This is because from statistical point these barcodes should meet a reasonable frequency in cases or controls.
Methods
In Eq. (1), a matrix M contains m samples, and each sample includes n SNPs and a label which indicates the status of sample (normal or effected). Each x _{ i,j } entry denotes genotype of jth SNP on the ith sample; in this study, x _{ i,j } could be 0 (major allele homozygous), 1 (heterozygotes) or 2 (minor allele homozygous). For a case–control study, l could be 0 (control) or 1 (case).
In this work, a MDMC measure is designed to determine a risky barcode which is highly associated with labels of samples. ACO algorithm is applied to generate these risky barcodes. Thus, our method can be denoted by MDMCACO.
MDMC
Let X be the random variable of barcode and X takes the values of the barcode x _{ i } = {x _{ i1}, … x _{ ij }, … x _{ in }}, where xij is the genotype on ith barcode at the SNP j, i ∈ {1, …, m}, j ∈ {1, …, n}. Then, let pcase(x i) be the frequency of x i on cases and let pcontrol(x i) be the frequency of x i on controls.
Maximum discrepancy(MD): Intuitively, a barcode, which causes a complex disease, should appear frequently in cases but rarely in controls. The discrepancy of frequencies should also be as large as possible. MD is described in Eq. (2).
When D is negative, xi is a weak solution. A more risky barcode should yield a larger value. SNPs selected according to MD could be redundant or noisy. For example, the discrepancy could be constant when a new SNP is added to a barcode. Therefore, the following maximum consistency condition can be used to restrain noisy SNPs.
(Maximum consistency)MC: If no other factors result in these cases except genes, these cases should be attributed to several risky barcodes. At the same time, these barcodes should never appear in controls. Although other unknown factors undermine this ideal situation, MC with cases should be maintained to exclude noise.
Equation (3) describes relative entropy, which is a fundamental measure in information theory [7, 8]. If p(x i) remains unchanged when a new SNP is selected, ∏ _{ j = 1} ^{n} pj(xij) decreases. Consequently, C decreases. It means that an unreasonable growth of the number of causing SNPs is restrained.
ACO algorithm
As the number of SNPs increases, the combination space of epistasis grows rapidly, which results in that a global optimal solution cannot be searched in such a huge space. Therefore, ACO algorithm is proposed to heuristically search for nearoptimal solutions.
The ACO algorithm, which is a classic swarm intelligence method, solves computational tasks by using a probabilistic technique [9]. ACO was initially proposed to solve an optimal path for the traveling salesman problem. ACO has been widely applied to various research areas. ACO simulates the behaviors of natural ants as they search for foods; some of these behaviors include laying down pheromone and so on. In the following subsections, the design of ACO is detailed.
Population encoding and initialization
In accordance with the risky barcode searching problem, each individual I of ACO, namely, barcode, is designed in a format on the basis of the selected SNPs and genotypes on these SNPs. The encoding scheme is expressed as follows:
where each component of I contains the ID of SNP and the genotype of the selected SNP. k is the number of SNPs in a barcode.
In the traditional ACO, ants can be randomly placed on SNPs. In this study, each artificial ant is distributed in a fixed distance because of the linkage disequilibrium (LD) between adjacent SNPs. For example, suppose that there are w ants and n SNPs. Then, all of these SNPs in a chromosome are divided into w segments. During the ACO initialization, the ith ant can only be placed in the ith segment, but the ant can randomly select a SNP in each segment. This strategy can improve global optimality.
Objective function
The processes of SNP selection and barcode generation are determined by an objective function. In this study, each solution generated by ACO will be evaluated by MDMC, where MD and MC are combined in Eq. (4).
Obviously, Eq. (4) is the simplest combination formula of MD and MC to maximize MDMC. Intuitively, a risky barcode which is highly correlated to the progress of complex diseases will yields a higher F.
SNP selection and pheromone update
A key communication medium of ACO that directs an ant’s decisionmaking ability is pheromone. If a SNP is selected by many ants, the pheromone on the SNP is rich. Therefore, the possibility of other ants selecting it to formulate a risky barcode is high. The ant decision of SNP selection is defined as follows:
R is the set of unresolved SNPs, and α and β are two important weights of pheromone trail and heuristic value, respectively. A heuristic value η, which can accelerate convergence, is essential for analysis of largescale dataset. When previous studies have found some risky SNPs which are pathogenic to a specific complex disease through clinical validation, we can set a high η for these risky SNPs to ensure that they will be selected with relatively high probability. However, we suppose that risky barcodes are identified from few candidate SNPs in this study. Heuristic information is unnecessary; thus, β is 0.
Once all the ants have searched for their solutions in each iteration, laying down pheromone and evaporation in all of the SNPs are triggered. In accordance with the objective function, the pheromone update of τ is defined as Eq. (6).
where ρ is the pheromone evaporation factor and ρ ∈ (0, 1). Δτi(t) is expressed as Eq. (7).
where T^{k}(t) is the set of SNP selected at the tth iteration. Eq. (7) denotes that the kth ant deposits a quantity of pheromone Δτi(t) on the SNPs involved in a potential barcode.
Determining the length of barcode
A pathogenic barcode could not be too frequent in controls and also could not be too infrequent in cases from a statistical view. According to this property, some noisy barcodes can be excluded, and oversize barcodes will be inhibited. Therefore, we introduce a noisy pruning strategy to determine the length of barcode in ACO. The length of a risky barcode is bound to a reasonable region neither too long nor too short.
During the process of barcode generation, we find that if a short barcode is infrequent, its superset is also infrequent. It means that if an ant generates a solution which is too infrequent in cases, there is no need to continue. Consequently, instead of expanding the size of barcodes, the ant then returns a short barcodes.
Pseudocode of MDMCACO
The pseudocode is listed below to describe MDMCACO and the outline of searching for potential risky barcodes is depicted in Fig. 1.
The first loop ends when a specific amount of iterations is reached or a fixed number of risky barcodes is satisfied.
Datasets and performance measurement
Dataset of breast cancer
The predictive barcodes related to a specific complex disease is identified through the following steps: (1) use GWAS to identify risky SNPs by a case–control study; (2) run a permutation test for these SNPs; (3) deeply sequence the mutation around candidate SNPs; (4) carry out a candidate association study to determine SNPs in epistasis; and (5) search for pathogenic barcodes.
In this study, risky barcodes of breast cancer are searched from a dataset which contains a candidate set of SNPs and 10,000 samples (5000 cases and 5000 controls). Twentythree candidate SNPs which are separately distributed in six genes, namely, COMT, CYP19A1, ESR1, PGR, SHBG, and STS are involved in the progress of breast cancer. We use the same number to represent the ID of each SNP, which is consistent with previous study [3]. These genes involved in steroid hormone metabolism and signaling are verified in [10–12]. Then, risky barcodes should also be analyzed to determine the harmful genotype of these genes in breast cancer.
Performance measurement
We apply two common measures, namely, OR and RR, to evaluate the performance of risky barcodes. These measures are widely used in epidemiological and case–control studies.
A predictive barcode can be used for early diagnosis or risk estimation. If a predictive barcode is considered as pathogenic, then an individual carrying the barcode would be diagnosed as positive. In Eqs. (8) and (9), TP denotes the proportion of cases carrying the predictive barcode, FP represents the proportion of controls holding the predictive barcode, and TN is the proportion of controls that do not carry the barcode, and FN is the proportion of cases that do not carry the barcodes.
Results
IGA is a method that applies an efficient strategy to improve genetic algorithms for generating potential barcode [3]. However, IGA still faces several issues. For instance, IGA does not consider trait heterogeneity. The only optimal solution of IGA ignores the multiple subtypes of complex diseases. Furthermore, IGA only uses the maximum difference as a fitness function, so that it cannot exclude noisy barcodes which exhibit high maximum difference with unreasonable frequency. In addition to these drawbacks, a long risky barcode could be generated, while a short barcode more likely occurs than a long barcode. In the following subsections, comprehensive results are described.
Comparison of MDMCACO and IGA on OR and RR
Because our method MDMCACO can generate multiple barcodes, we choose three different risky barcodes denoted as solution1, solution2, and solution3, respectively, and compare with those of IGA. The maximum length of these barcodes is 9 because our method restricts the frequency of barcodes. The comparison results of OR and RR are depicted in Figs. 2 and 3.
For solution1, these results of OR and RR are equal to IGA when the length of barcodes is shorter than 7. As the length increases, solution1 is generally better than IGA. By contrast, solution3 sustains its advantages as the length increases except when the length is equal to 6 and 9. The performance of solution3 is also different from IGA and solution1. For solution2, it is more risky than other solutions at the beginning. However, the pattern of solution2 is disrupted when other noisy SNPs are selected. Although the performance of solution2 declines, solution2 is also risky. It means that these barcodes are at different level of risk. These risky barcodea could lead to different subtypes of complex diseases, which indicates that our method can identify risky barcodes correlated with multiple subtypes. The details of these solutions of our method are discussed in the following sections.
Multiple risky barcode details
Multiple subtype patterns may hidden in cases because of trait heterogeneity. In this study, ACO is designed to obtain multiple risky barcodes correlated with subtypes of breast cancer. The details of these three barcodes are listed in Tables 1, 2 and 3. The maximum length of risky barcodes is set as 6 since a shorter barcode is statistically easier to form from statistical view.
When OR and RR of the barcodes are bigger than 1, the barcodes can be considered risky. When there are several different barcodes regarded as risk factors, the barcodes may represent different patterns correlated with different subtypes of breast cancer. Once the barcodes have been ranked according to the degree of risk, researchers can select the top barcodes for further clinical analysis.
From the results of 95% CI on OR and RR in Tables 1, 2 and 3, although these barcodes meet the criteria MD and MC and satisfy the requirement of OR and RR, not all of them meet a satisfied significant level. It is because that we have not set limitation directly on statistical significance during searching solutions. To further analyze statistical significance, we run our method three times and calculate the pvalue of several potential barcodes in these three solutions as shown in Table 4.
Barcodes listed in Table 4 are the most frequent solutions saved by MDMCACO in different run configurations. The barcodes located in the first six rows in Table 4 are statistically significant (pvalue < 0.05). In addition to statistical significance, we will give further analysis on pathogenic SNP and frequency of barcodes in next subsections.
The most common pathogenic SNP
Although the subtypes of complex diseases may exhibit different properties, these subtypes likely display a common feature to some extent. Therefore, some common SNPs may be involved in various risky barcodes. In this study, these potential barcodes obtained by MDMCACO are further analyzed. We find that among these barcodes, SNP 4 (rs3020314) on the ESR1 whcih is the most common SNP appears in all risk barcodes.
MDMCACO has generated numerous intermediate results during the process of optimization. We analyze these intermediate results and find that once a barcode carries rs3020314, the value of OR and RR would be increase. In previous studies [13–15], ESR1 encodes an estrogen receptor, a ligandactivated transcription factor composed of several domains essential for hormone binding, DNA binding, and transcription activation. ESR1 also participates in pathological processes, such as breast cancer, endometrial cancer, and osteoporosis. These biological functions of ESR1 verify the statistical results of our method.
Determination of the length of barcodes
The length of barcodes should be restricted due to two reasons : 1) For example, the OR and RR of solution1 increase as the length of barcodes grows. However, the optimum combination remains unknown. 2) If a short barcode is risky and pathogenic, then the short barcode more likely forms than a long barcode. However, a very short barcode may lead to information loss and false positive results may be obtained. Therefore, it is important to determine the length of potential barcodes.
To address these issues, we apply a further restriction on the frequency of these barcodes. Suppose that a complex disease only contains one kind of subtype and no other environmental factors influence the subtype. In a ideal case, patients likely carry one type of barcodes, and this barcode never exists in normal situations. However, other genetic or environmental factors can disrupt the ideal situation. Thus, the pathogenic pattern can be more complex in case–control studies: 1) multiple barcodes may be found in cases; 2) risky barcodes exist in controls.
Although the ideal situation is disturbed and complex situations possibly occur, it cannot be too extreme. In other words, risky barcodes cannot be too infrequent in cases and the barcode cannot also be too common in controls. In this study, a threshold of frequency is set as 1%. This finding indicates that the frequency of each risky barcode should satisfy the following condition: the frequency should be <1% in controls and the frequency should be >1% in cases. A total of 5000 cases and 5000 controls are considered in our study. Therefore, the number of patients carrying the barcode should be >50 and the number of normal people carrying the barcode should be <50. Two risky barcodes correlated to breast cancer survive because of this restriction, and their lengths are 6 and 3, respectively. One barcode is composed of 41117182122 (rs3020314rs9340799rs660149rs11571171rs272428rs858524) whose frequencies in cases and controls are 70 and 47, respectively. Another barcode is composed of 468(rs3020314rs1543404rs2747652) whose frequencies in cases and controls are 56 and 46, respectively.
Discussion
In this study, an intelligent algorithm is applied to optimize constraints MD and MC for identifying pathogenic barcodes. However, our approach cannot guarantee an optimal and stable solution for this kind of problem since ant colony algorithm is involved in probabilistic techniques. Despite this limitation, our method commits several merits as forementioned.
The results of our method are detailed in different aspects. IGA is compared with MDMCACO to validate the effectiveness of our method. These results reveal that our method can effectively determine pathogenic risky barcodes. Another merit of our method is that MDMCACO can generate multiple solutions which are related to different subtypes of complex diseases because of trait heterogeneity. Some of these barcodes are dominant since their OR and RR are significantly >1. A common ground is likely found among these different subtypes of complex diseases. It means that there are several common SNPs referred to different subtypes. Therefore, these common pathogenic SNPs need be further understood by gene function analysis.
In addition, considering the restriction of the barcode frequency, we can determine the optimum length of barcodes and exclude noisy barcodes. Multiple barcodes are regarded as risk factors of complex disease. Noise may exist in these solutions. Filter strategy can be applied. For example, these barcodes can be ranked on the basis of OR and RR. These top barcodes are dominant in cases.
The barcode rs3020314rs1543404rs2747652 whose OR and RR are 1.1 and 1.2 contains three SNPs and all of them exist in ESR1. Another barcode rs3020314rs9340799rs660149rs11571171rs272428rs858524 contains six SNPs located in ESR1, PGR, and SHBG, separately; the OR and RR of this barcode are 1.2 and 1.5, respectively. All of these genes are correlated with breast cancer [16–19].
Another challenging issue of epistasis analysis is computational complexity. In the pseudocode of MDMCACO, the complexity is Θ(I × A × L × m × n), where I is the number of the maximum iterations of ACO, and A is the amount of artificial ants, and L is the average length of barcodes, and m is the number of samples and n is the number of SNP.
Conclusions
In this study, MD and MC criteria are combined with ACO to identify risky barcodes related to breast cancer. MDMCACO has addressed several important issues. The results show that our method can be used for the epistasis analysis of complex diseases. Although MDMCACO can identify several risky barcodes, these barcodes should be further validated in clinical studies.
Although our method provides several advantages, further studies should be performed 1) to use other complex diseases to evaluate MDMC; 2) to design an algorithm appropriate for GWAS and candidate studies; and 3) to develop a software with friendly GUI.
Abbreviations
 ACO:

Ant colony optimization
 GWAS:

Genomewide association studies
 MDMC:

Maximum discrepancy and maximum consistency
 OR:

Odds ratio
 RR:

Risk ratio
 SNPs:

Single nucleotide polymorphisms
References
 1.
Chakravarti A, Clark AG, Mootha VK. Distilling pathophysiology from complex disease genetics. Cell. 2013;155(1):21–6.
 2.
Chang HW, Yang CH, Ho CH, et al. Generating SNP barcode to evaluate SNP–SNP interaction of disease by particle swarm optimization. Comput Biol Chem. 2009;33(1):114–9.
 3.
Yang CH, Lin YD, Chuang LY, et al. Evaluation of breast cancer susceptibility using improved genetic algorithms to generate genotype SNP barcodes. IEEE/ACM Trans Comput Biol Bioinform. 2013;10(2):361–71.
 4.
Yang J, Ferreira T, Morris AP, et al. Conditional and joint multipleSNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012;44(4):369–75.
 5.
Li X, Liao B, Chen H. A new technique for generating pathogenic barcodes in breast cancer susceptibility analysis. J Theor Biol. 2015;366:84–90.
 6.
Urbanowicz RJ, Andrew AS, Karagas MR, Moore JH. Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach. J Am Med Inform Assoc. 2013;20(4):603–12.
 7.
Liu Z, Lin S. Multilocus LD measure and tagging SNP selection with generalized mutual information. Genet Epidemiol. 2005;29(4):353–64.
 8.
Cover T, Thomas JA. Elements of information theory. New York: Wiley; 1991.
 9.
Marco D, Gambardella LM. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans Evol Comput. 1997;1(1):53–64.
 10.
Giovannelli P, Di Donato M, Giraldi T, et al. Targeting rapid action of sexsteroid receptors in breast and prostate cancers. Front Biosci (Elite Ed). 2011;4:453–61.
 11.
Yu JC, Hsiung CN, Hsu HM, et al. Genetic variation in the genomewide predicted estrogen response elementrelated sequences is associated with breast cancer development [J]. Breast Cancer Research. 2011;13(1):R13.
 12.
Pharoah PD, Tyrer J, Dunning AM, Easton DF, Ponder BA. Association between common variation in 120 candidate genes and breast cancer risk. PLoS Genet. 2007;3(3):article e42.
 13.
Robinson DR, Wu YM, Vats P, et al. Activating ESR1 mutations in hormoneresistant metastatic breast cancer. Nat Genet. 2013;45(12):1446–51.
 14.
Toy W, Shen Y, Won H, et al. ESR1 ligandbinding domain mutations in hormoneresistant breast cancer. Nat Genet. 2013;45(12):1439–45.
 15.
Udler MS, Azzato EM, Healey CS, et al. Common germline polymorphisms in COMT, CYP19A1, ESR1, PGR, SULT1E1 and STS and survival after a diagnosis of breast cancer. Int J Cancer. 2009;125(11):2687–96.
 16.
Gabriel CA, Mitra N, Demichele A, et al. Association of progesterone receptor gene (PGR) variants and breast cancer risk in African American women. Breast Cancer Res Treat. 2013;139(3):833–43.
 17.
Fortunati N, Catalano MG. Sex hormonebinding globulin (SHBG) and estradiol crosstalk in breast cancer cells. Horm Metab Res. 2006;38(4):236–40.
 18.
Thompson DJ, Healey CS, Baynes C, et al. Identification of common variants in the SHBG gene affecting sex hormonebinding globulin levels and breast cancer risk in postmenopausal women. Cancer Epidemiol Biomarkers Prev. 2008;17(12):3490–8.
 19.
Sulkes A, Fuks Z, Gordon A, et al. Sex hormone binding globulin (SHBG) in breast cancer: a correlation with obesity but not with estrogen receptor status. Eur J Cancer Clin Oncol. 1984;20(1):19–23.
Acknowledgments
We are grateful to the anonymous reviewers whose suggestions and comments contributed to the significant improvement of this paper.
Funding
This paper is partially supported by the National Natural Science Foundation of China (Serial No.61602174 and 61563016), the Jiangxi Provincial natural science fund (No. 20161BAB212052 and 20151BAB217011) and the Scientific and Technological Research Project of Education Department in Jiangxi Province(GJJ150496). This work is also supported by Hunan Province high school Scientific Research Project (Grant No. 12C1093).
Availability of data and materials
Please contact author for data requests.
Authors’ contributions
XL carried out the design of the study and performed the statistical analysis. WJ participated in software coding and helped to draft the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Single nucleotide polymorphisms
 Complex diseases
 Ant colony algorithm
 Epistasis