 Review
 Open Access
 Published:
UGM: a more stable procedure for largescale multiple testing problems, new solutions to identify oncogene
Theoretical Biology and Medical Modelling volume 16, Article number: 20 (2019)
Abstract
Variations of gene expression levels play an important role in tumors. There are numerous methods to identify differentially expressed genes in highthroughput sequencing. Several algorithms endeavor to identify distinctive genetic patterns susceptable to particular diseases. Although these processes have been proved successful, the probability that the number of nondifferentially expressed genes measured by false discovery rate (FDR) has a large standard deviation, and the misidentification rate (type I error) grows rapidly when the number of genes to be detected become larger. In this study we developed a new method, Unit Gamma Measurement (UGM), accounting for multiple hypotheses test statistics distribution, which could reduce the dependency problem. Simulated expression profile data and breast cancer RNASeq data were utilized to testify the accuracy of UGM. The results show that the number of nondifferentially expressed genes identified by the UGM is very close to the realevidence data, and the UGM also has a smaller standard error, range, quartile range and RMS error. In addition, the UGM can be used to screen many breast cancerassociated genes, such as BRCA1, BRCA2, PTEN, BRIP1, etc., provides better accuracy, robustness and efficiency, the method of identification differentially expressed genes in highthroughput sequencing.
Introduction
Cancer is a major public health problem worldwide. It is a disease that arises from uncontrolled cell cycle, proliferation and intercellular communication. As of to date, more than 100 types of cancers were diagnosed in human [1]. Scientists have reached a consensus that cancer is caused by both genetic factors, such as mutations and disrupted hormones, and environmental factors [2]. Some tumors are hereditary diseases, which are attributed by the disorder of the mechanism regulating cell growth and proliferation. In general, genetic or epigenetic changes in DNA could confer a normal cell potential malignancy [3, 4]. Cellular oncogenes, antioncogene and DNA repair genes are major types of genes that contribute to this process. The interaction of these genes is sometimes referred to as the “driver” of cancer [5].
Although the genomic composition of cells are almost identical for an individual, genetic, transcriptional and expression variation may occur during cell differentiation and proliferation. Investigation into the difference of gene expression profiles among cells in different state would provide significant insights into the function of genes and their products [6]. The identification of affiliation/connection between disease and genetic or expressional pattern renders tremendous/enormous significance. Differentially expressed genes and proteins can be screened from the level of genes and proteins, respectively.. Screening differential molecules can be accomplished in two ways: screening from protein expression data or using RNASeq data to detect differentially expressed genes. Over the past decade many genomewide studies have demonstrated that there are many genes harboring overrepresented mutations, such as tumour protein 53 (TP53) [7], phosphatase and tensin homolog deleted on chromosome ten (PTEN) [8], kirsten rat sarcoma viral oncogene homolog (KRAS) [9], myelocytomatosis viral oncogene (MYC) [10], breast cancer (BRCA) [11] .
Gene chip is also known as Bioarray or microarray, and this technology is based on the theory of hybridization by Edwin Mellor Southern. In the 1980s, gene chip prototype was proposed. The first gene chip was achieved in 1991. With the development of human genome project and molecular biology technology, gene chip technology has been developing rapidly in the past 20 years. Gene chip can detect the growth of tumorrelated information, and has evolved to be a sophisticated technology in tumor detection and analysis. The rapid development of gene chip technology has brought revolutionary impact on medical research [12].
Genomics research shows that the gene expression differences are associated with biological conditions and disease stages. It is a useful tool of microarray technology for quantitative analysis of gene expression in recent decades. Both the microarray data and RNASeq data is characterized by low sample size and high dimensional variables. Therefore, when identifying differentially expressed genes in these data, multiple comparisons are required. When we conduct multisample hypothesis tests, the false discovery rate (FDR) is a widely adopted method to control type I errors in null hypothesis testing. The FDR method is a probability designed to control the false events [13, 14]. For type I error, the FDR controlling procedure is not as strict as family wise error rate (FWER) controlling procedures, which controls the probability of more than one type I error [15]. Therefore, FDR controlling programs have an advantage over type I errors, but at the cost of increasing the error rate [16, 17]. At the same time, the results of different methods are quite different. So far, there is still no unified conclusion in the scientific community regarding the most efficient, robust and accurate method. Therefore, this paper aims to propose a new method for screening differentially expressed genes based on gene expression profiling data, and uses simulated gene chip data and breast cancer data to verify the validity and accuracy of the proposed method. Furthermore, this article also aims to provide a case study for the screening of clinical differentially expressed genes.
Methods
Multiple hypothesis testing and FDR
In the 1950s, multiple hypothesis testing began to gain attraction, especially for highthroughput data analysis, where the problem of multiple comparisons was particularly outstanding. Microarray data is an example of highdimensional data, which is characterized by small sample size and high variable dimension, which constituted a typical multiple hypothesis testing problem. Table 1 summarizes this situation in traditional form.
The definition of FDR is the expectation of false discovery rate(V/R). At present FDR has been widely used in practical problems. According to the literature reported, when m_{0} = m, then FDR = FWER. When m_{0} ≤ m, then FDR ≤ FWER. FDR not only improves the test capability, but also makes better the traditional multihypothesis test process, which is too conservative. Therefore, FDR supplies a applicable error calculation standard for multiple tests of largescale data. FDR commonly used control process Benjamini, & LIU (BL), Benjamini, & Hochberg (BH), Benjamini & Yekutieli (BY) and adaptive linear step up (ALSU). Currently the most widely used method is the ALSU procedure. The ALSU procedure as follows:
 (1)
Let H_{01}, H_{02}, H_{03}, …, H_{0m} be the tested null hypotheses. Using single test method to test each event and get P values P_{1}, P_{2}, P_{3}, …, P_{m}, and sort p values \( {P}_1^{\ast },{P}_2^{\ast },{P}_3^{\ast },\dots, {P}_m^{\ast } \).
 (2)
Let \( r\left(\lambda \right)=\underset{1\le i\le m}{\max}\left\{i:{P}_i^{\ast}\le \lambda \right\} \), where λ is usually taken as 0.5. r(λ) represents the number less than λ.
 (3)
Estimate \( {\hat{\pi}}_0 \) by \( {\hat{\pi}}_0=\frac{mr\left(\lambda \right)}{m\ast \left(1\lambda \right)} \). Estimate \( {\hat{m}}_0 \) by \( {\hat{m}}_0=\frac{mr\left(\lambda \right)}{1\lambda } \), where \( {\hat{m}}_0 \) is the number of true vents.
 (4)
Estimate \( \hat{k}=\arg \underset{1\le i\le m}{\max}\left\{i:{P}_i^{\ast}\le \frac{i}{m}\ast \alpha \right\} \). Where α = 0.05.
 (5)
If \( \hat{k} \) exists, reject the events of \( {H}_{0(1)}^{\ast },{H}_{0(2)}^{\ast },{H}_{0(3)}^{\ast },\dots, {H}_{0\left(\hat{k}\right)}^{\ast } \). Else, do not reject any hypotheses.
 (6)
Adjust \( {P}_i^{\ast } \) by \( {P}_i^{\ast }=\underset{i\le k\le m}{\min}\left\{\min \left\{\frac{\hat{m_0}}{k}\ast {P}_k^{\ast },1\right\}\right\} \).
From the above introduction, we can figure out that the key step of the ALSU procedure is the appraisal of m_{0}. The accuracy of m_{0} is crucial for the screening of differentially expressed genes, FDR control processes and testing capabilities. However, statisticians found that this approach is very unstable [18]. In spite of the fact that we repeated many times FDR procedure and get the mean of m_{0} is exactly similar to the true value, the standard deviation (SD) is very large, which caused wide random deviation. Therefore, it is necessary to improve the estimation algorithm of m_{0}.
New estimation method
The Pvalue is the probability that the sample emerge extreme results when the null event is true. In the hypothesis test, the Pvalue is used to determine the hypothesis test results and reflects the feasibility of the test results, i.e., the level of accepting and rejecting the null hypothesis. The smaller P value, the more significant the hypothesis test result. If we assume the null hypothesis is H_{0}, the alternative hypothesis is H, and the sample observations are X_{1}, X_{2}, X_{3}, …, X_{n}. After selecting the appropriated statistic T, we can compute the corresponding P value. In multiple hypothesis tests, the Pvalue results are shown in Fig. 1.
From Fig. 1 we can get that P value is a very regular nature in the ideal state. If the number of genes is m, and the ratio of the number of nondifferentiated genes is π_{0}, therefore the number of nondifferentiated genes are m_{0} = m ∗ π_{0}. Assuming there is a value γ, which all differential expression of gene test P values are distributed in (0, γ). In this case, the genes distributed in (γ, 1) should be all nondifferentially expressed genes. In this region, the number of nondifferentially expressed genes in unit gamma length were \( \underset{1\le i\le m}{\min}\left\{i:{P}_i^{\ast}\ge \gamma \right\}\ast \frac{\gamma }{1\gamma}\#\left\{{\mathrm{P}}_{\mathrm{i}}\ge \upgamma \right\}\ast \frac{\upgamma}{1\upgamma} \). Therefore the number of genes distributed in (0, γ) should theoretically be the sum of all the differentially expressed genes and \( \underset{1\le i\le m}{\min}\left\{i:{P}_i^{\ast}\ge \gamma \right\}\ast \frac{\gamma }{1\gamma } \), i.e., the number of genes in (0, γ) is \( m{m}_0+\underset{1\le i\le m}{\min}\left\{i:{P}_i^{\ast}\ge \gamma \right\}\ast \frac{\gamma }{1\gamma}\mathrm{m}{\mathrm{m}}_0+\frac{\upgamma}{1\upgamma}\#\left\{{\mathrm{P}}_{\mathrm{i}}\ge \upgamma \right\}. \). In order to avoid the effect of random error, we calculated the number of nondifferentially expressed genes in the multigammas.
The key of this algorithm is to appraise m_{0}m_{0}. Let H_{01}, H_{02}, H_{03}, …, H_{0m} be null hypothesis (genes). Correspondingly, the Pvalues of independent hypothesis tests are P_{1}, P_{2}, P_{3}, …, P_{m}. Level of significance is α. Because this article uses the concept of the unit gamma length number of genes. In this paper, the algorithm is named Unit Gamma Measurement (UGM), which process as follows:
 (1)
Let H_{01}, H_{02}, H_{03}, …, H_{0m} be the tested null hypotheses. Using single test method to test each event and get P values P_{1}, P_{2}, P_{3}, …, P_{m}, and sort p values \( {P}_1^{\ast },{P}_2^{\ast },{P}_3^{\ast },\dots, {P}_m^{\ast } \).
 (2)
Select the appropriate cutoff gamma, which is used to qualitatively divide the P value. Gamma should be greater than the Level of significance. Gamma can be appropriately increased when there are lots of genes. Calculate the number of genes distributed in (0, γ), (γ, 2γ), …, (n ∗ γ, (n + 1) ∗ γ). (n + 2) ∗ γ was greater than 1. We define Pre _ γ and Lat _ γ(k) as follows:

(3)
Estimate m − m_{0}. Estimation method as follow:
τ_{i}τ_{i} was weight coefficient, which formula is as follows:
 (4)
Get \( {\hat{m}}_0 \)

(5)
Adjust \( {P}_i^{\ast } \) by \( {P}_i^{\ast }=\underset{i\le k\le m}{\min}\left\{\min \left\{\frac{\hat{m_0}}{k}\ast {P}_k^{\ast },1\right\}\right\} \).
Simulation experiment and evaluation parameters
We use in silico analysis to generate gene expression profiles according to the data structure presented in Table 2. The sample size of the experimental group (patient group) and the control group (normal observation group) is 40. The population mean of gene expression levels of experimental group and control group is μ_{1i} and μ_{2i}. When the gene (nondifferentially expressed gene) number is less than m_{0}, μ_{1i} = μ_{2i} = μμ_{1i} = μ_{2i} = μ. When the gene (differentially expressed gene) number is more than m_{0}–1, μ_{1i} ≠ μ_{2i}. In order to avoid the impact of accidental factors on the results, we performed 1000 repeated experiments on the algorithm for different values of π_{0}.
Results
Performance on simulated data
In general, the proportion of differentially expressed genes was small, i.e., π_{0}π_{0} was large. In the simulation, we set the total number of genes (m) was 10,000, 8000, 5000, 3000, 2000 and 1000. We set the value of π_{0}π_{0} was 0.8,0.85,0.9 and 0.95. In each case, we estimated the m_{0} using Adaptive Benjamini and Hochberg (ABH), Storey & Tibshiraniλ (S~λ), Two Stages Test (TST) and UGM methods and computed the average of m_{0} with repeated 1000 times simulations.
Table 3 showed the mean of m_{0} estimated by ABH, Sλ, TST, UGM in different conditions. We used the estimated m0 values and the actual m_{0} value to do the relative error analysis. The result shows that the relative error of the UGM method is distributed between − 0.181 and 0.156%. The relative error of the other three estimation methods were distributed between 0.071 and 5.900%, − 0.708 and 0.431%, − 4.873% and − 4.633%. The estimation results of m_{0} in the four methods have identical tendency as the actual value. However, the results of the UGM method and the ABH, TST method have significant difference (P = 0.01296, P = 0.0000, chisquare test), which is undetected between the UGM method and the Sλ method (P = 0.8644).
The SD represents the discrete degree of the data. The range is the diversity between the maximum and minimum values in a list of numbers. The quartile range is the distance between upper quartile and lower quartiles. Both range and quartile range can reflect the fluctuation range and the discrete degree of the data. The root mean squared error (RMSE) is used to measure the disparity between the estimated values and the true values. The coefficient of variation (CV) is used to indicate the difference between the different indicator units. Table 4 compares the results of m_{0} estimation of the four methods using six indicators.
Table 4 showed that all the results of four methods trended to 2850. However, there was a big deviation yielded by the TST method computing the number of nondifferentially genes (2714.4), i.e., the TST algorithm is less reliable for m_{0} estimation. The mean shows that the m_{0} estimated by the UGM method is the closest to the real value, which slightly better than the Sλ algorithm. In addition, the quartile range computed using ABH, UGM and Sλ method were increased. But the results of ABH and UGM method were very close to each other. What’s more, the SD, range and CV derived by the UGM method are better than both the ABH and Sλ method, which means that the discrete extent of the data calculated using the proposed method is smaller. In summary, UGM is more stable, accurate and robust. The UGM method is better than other conventional algorithms.
Performance on real data
In order to verify the validity and accuracy of UGM, we selected the breast cancer gene chip data to further verify UGM in this paper. However, the selection of real data is random and unlimited breast cancer gene chip data, which is part of our previous research. In this paper, the gene chip data was downloaded from the NCBI\GEO database. (platforms number: GPL570; accession number: GSE31192 [5, 19]. Total RNAs were extracted from breast cancer and normal tissues. The experimental group was women with breast cancer, and the control group was women of the same age without breast cancer. Malignant epithelia and tumorassociated matrix of pregnancyassociated breast cancer (PABC) and nonPABC were isolated by laser capture microdissection and gene expression profile. Eventually, a total of 33 set of gene expression data composed of 20 tumors tissue and 13 normal tissues profiled by 22,283 probes were obtained.
Breast cancer gene chip data were pretreated by the RMA procedure, and all probes P values were computed with ttest or Satterthwaite’s approximate ttest. With FDR set at 0.05, ALSU and the UGM estimated m0 and identified the differentially expressed genes associated with breast cancer. Results were shown in Table 5.
The results showed that UGM algorithm and ALSU algorithm respectively yielded 4397 (8.04%) and 4282 (7.83%) differentially expressed genes. While the general ttest resulted in 11,319 (20.7%). The UGM and the ALSU were reduced by 6922 (61.2%) and 7037 (62.2%). The ALSU and the UGM methods are significantly more powerful than the general ttest (p = 0). What’s more, the UGM method calculating the number of differentially expressed genes were slightly higher than the result of ALSU, suggesting that the UGM method renders a more comprehensive screening results with higher efficiency and a reduced false negative rate.
Risk factors for developing breast cancer include being female, obesity, lack of physical exercise, drinking alcohol, ionizing radiation, etc. In recent years, many cancers have been recognized as inherited disease with a subset of genes mutated, including BRCA1 and BRCA2, both of which are tumor suppressor. These proteins help repair damaged DNA and, therefore, play a role in ensuring the stability of the cell’s genetic material. Specific inherited mutations in BRCA1 and BRCA2 increase the risk of female breast and ovarian cancers, and they have been associated with increased risks of several additional types of cancer. In this paper, we used the UGM algorithm to analyze the gene expression profile data of breast cancer. The results showed that BRCA1 (P = 0.007) and BRCA2 (P = 0.000129) were selected the genes susceptible to cancer (differentially expressed genes). What’s more, many genes related to BRCA1 and BRCA2 have been screened out. They are BRIP1 (P = 0.0000572), PTEN (P = 0.00399), RAD51 (P = 0.00389), BARD1 (P = 0.0344), MMP11 (P = 0.0256), RRM2 (P = 0.000823), NEK2 (P = 0.0000149), MKI67 (P = 0.000397), ITGA7 (P = 0.0195), CXCL5 (P = 0.0014).
In this paper, the data we used were breast cancer gene expression profile data. we further used the DAVID Bioinformatics Resources 6.8 (https://david.ncifcrf.gov) to analyzed genedisease association of differentially expressed genes. DAVID 6.8 allows researchers to associate sets of genes from a gene list (differentially expressed genes list) to disease phenotype, employing information from OMIM and the Genetic Association Database mapped to DAVID genes. The results showed that there were 2 terms associated with breast cancer, and 224 (8.414%) genes were enriched in diseaseassociated with breast cancer (p1 = 8.31E05, p2 = 1.57E04). The results of genedisease association analysis by differentially expressed genes are shown in Fig. 2.
Conclusion and discussion
In this paper, we have improved the use of pvalue of multiple hypothesis testing in identifying diseaseassociated genes. The estimation results of methods were compared using simulated microarray data with mean, SD, range, quartile range, RMSE and CV as evaluation indices. The simulation results showed that the mean of nondifferentially genes (m_{0}) estimated by the new method was very close to the real value. The results of the UGM method and the ABH, TST method have significant differences (P = 0.01296, P = 0.0000). However, there was no significant difference between the UGM method and the Sλ method (P = 0.8644). These results suggested that the UGM method and Sλ method are significantly superior to the ABH and the TST methods. In addition, the SD, range, quartile range, CV and RSME of the number of nondifferentially expressed genes calculated by the Sλ method were all larger than those of the UGM method and are more discrete, which is concordant with the study by Wu Jing [16]. In summary, the UGM exhibited better stability, accuracy and robustness,which was better than other conventional algorithms.
In order to verify the effectiveness of the new proposed method in screening differentially expressed genes, we used this method to calculate the gene expression profile data of breast cancer. The results displayed that the UGM method was significantly more powerful than the general ttest (p = 0), and has slightly larger set of differentially expressed genes than those of the ALSU, presenting lower false negative rate and higher screening efficiency. In the differentially expressed genes screened by UGM method, a bunch of wellestablished oncogenes and antioncogenes were discovered, including BRCA1, BRCA2, PTEN, BRIP1 [20], RAD51 [21], BARD1 [16, 17], MMP11 [22], RRM2 [23], NEK2 [24] et al. Furthermore, genes associated with BRCA1, BRCA2 and TP53 were also identified, such as ITGA7 [25], CXCL5 [26] etc.
Microarray technology and DNA and RNA sequencing technology produced huge amount of gene data, which has been widely used in biomedical research. The data dimension of gene expression profile is high and the sample size is small. Identifying informative candidate genes from expression profile data has become an imperative task and attracts extensive attention in the field of biology and medical statistics research. Microarrays can provide a dynamic snapshot of cell activity, but the results are not noticeable/obvious. The purpose of this paper is to provide useful answers to some of the most common practical problems in microarray data analysis, especially the multiple validation of differential expressions.
In the field of microarray data analysis, one of the critica problems of multiplicity test is to estimate the number of true null hypothesis. Traditional processes have dominated the FWER, which is the probability of type I error. When the number of genes is large, the ability to detect differentially expressed genes decreases, and the bona fide differentially expressed genes may be ignored. In actual research, identifying differentially expressed genes from expression profile data is important for gene localization, identification of biomarker and therapeutic targets and study of disease mechanism. The expected percentage of the null hypothesis that is wrongly rejected is a meaningful indicator in multiple comparisons, but not the probability of error detection. In this background, Benjamini and Hochberg [14] developed the FDR control program, which was a groundbreaking achievement. The traditional method needs tight dominate the FWER, with a conservative type I error rate dominated contra any configuration of the hypothesis tested. The FDR method keeps the errorrecognition rate within the allowable range, which provides an appropriate metric for multiple tests of largescale data. Following Benjamin and Hochberg (BH) ‘s pioneering paper, the concept of FDR has been widely used in largescale data analysis. For the BH method, many scholars have extended on their basis and developed many excellent methods. The adaptive linear stepup (ALSU) method proposed by Benjamin et al. has been widely used in previous studies.
The key step in the ALSU process is to estimate the number of nondifferentially expressed genes. However, we find that the estimation method proposed in this process is not accurate enough. Although the average of the estimated values has been very close to the true value over the course of many iterations, it is still far from the standard deviation. This introduces large amount of random errors, leading directly to inaccurate final results. In this study, we designed a new method to estimate the number of nondifferentially expressed genes and proved its superiority, by using wellestablished microarray data.
Availability of data and materials
The gene chip data are available at https://www.ncbi.nlm.nih.gov/. The genedisease association analysis is available at https://david.ncifcrf.gov. All data and materials are fully available without restriction.
References
 1.
Datta K, Choudhuri M, Guha S, Biswas J. Breast cancer scenario in a regional cancer Centre in eastern India over eight yearsstill a major public health problem. Asian Pac J Cancer Prevent Apjcp. 2012;13:809–13. https://doi.org/10.7314/apjcp.2012.13.3.809.
 2.
Stojadinovic A, Summers TA, Eberhardt J, Cerussi A, Grundfest W, Peterson CM, et al. Consensus recommendations for advancing breast Cancer: risk identification and screening in ethnically diverse younger women. J Cancer. 2011;2:210–27. https://doi.org/10.7150/jca.2.210.
 3.
Schmidt LS, Linehan WM. Genetic predisposition to kidney cancer. Semibars Oncol. 2016;43:566–74. https://doi.org/10.1053/j.seminoncol.2016.09.001.
 4.
Salehi M, Kamali E, Karahmadi M, Mousavi SM. RORA and autism in Isfahan population: is there an epigenetic relationship. Cell J. 2017;18:540–6. https://doi.org/10.22074/cellj.2016.4720.
 5.
Li FY, Zhou J, Xu M, Yuan G. Exploration of a multitarget ligand, dehydroevodiamine, for the recognition of three Gquadruplexes in cMyb protooncogene by ESIMS. Int J Mass Spectrom. 2017a;414:39–44. https://doi.org/10.1016/j.ijms.2017.01.006.
 6.
Heikkila JJ. The expression and function of hsp30like small heat shock protein genes in amphibians, birds, fish, and reptiles. Comparat Biochem Physiol AMolec Integ Physiol. 2017;203:179–92. https://doi.org/10.1016/j.cbpa.2016.09.011.
 7.
Sato K, Hara T, Ohya M. The code structure of the p53 DNAbinding domain and the prognosis of breast cancer patients. Bioinformatics. 2013;29:2822–5. https://doi.org/10.1093/bioinformatics/btt497.
 8.
Jia PL, Zheng SY, Long JR, Zheng W, Zhao ZM. dmGWAS: dense module searching for genomewide association studies in proteinprotein interaction networks. Bioinformatics. 2011;27:95–102. https://doi.org/10.1093/bioinformatics/btq615.
 9.
Ambroise J, Piette AS, Delcorps C, Rigouts L, De Jong BC, Irenge L, et al. AdvISERPYRO: amplicon identification using SparsE representation of PYROsequencing signal. Bioinformatics. 2013;19:1963–9. https://doi.org/10.1093/bioinformatics/btt339.
 10.
Panopoulos AD, Smith EN, Arias AD, Shepard PJ, Hishida Y, Modesto V, et al. Aberrant DNA methylation in human iPSCs associates with MYCbinding motifs in a clonespecific manner independent of genetics. Cell Stem Cell. 2017;20:505. https://doi.org/10.1016/j.stem.2017.03.010.
 11.
Farmer H, McCabe N, Lord CJ, Tutt ANJ, Johnson DA, Richardson TB, et al. Targeting the DNA repair defect in BRCA mutant cells as a therapeutic strategy. Nature. 2015;434:917–21. https://doi.org/10.1038/nature03445.
 12.
Park PJ. ChIPseq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009;10:669–80. https://doi.org/10.1038/nrg2641.
 13.
Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, et al. NCBI GEO: mining millions of expression profilesdatabase and tools. Nucleic Acids Res. 2005;33:D562–6. https://doi.org/10.1093/nar/gki022.
 14.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JR Stat Soc. 1995;57:289–300. https://doi.org/10.1111/j.25176161.1995.tb02031.x.
 15.
Benjamini Y, Krieger AM, Yekutieli D. Adaptive linear stepup procedures that control the false discovery rate. Bioinformatics. 2006;93:491–507. https://doi.org/10.1093/biomet/93.3.491.
 16.
Wu J, Liu CY, Chen WT, Ma WY, Ding Y. A new method for estimating the number of nondifferentially expressed genes. Genet Mol Res. 2016a;15:13–28. https://doi.org/10.4238/gmr.15017402.
 17.
Wu WW, Nishikawa H, Fukudal T, Vittal V, Asano M, Miyoshi Y, et al. Interaction of BARD1 and HP1 is required for BRCA1 retention at sites of DNA damage. Cancer Res. 2016b;75:1311–21. https://doi.org/10.1158/00085472.can142796.
 18.
Burbelo PD, Arnbatipudi K, Alevizos I. Genomewide association studies in Sjögren’s syndrome: what do the genes tell us about disease pathogenesis? Autoimmun Rev. 2014;13:756–61. https://doi.org/10.1016/j.autrev.2014.02.002.
 19.
Li WX, He K, Tang L, Dai SX, Li GH, Lv WW, et al. Comprehensive tissuespecific gene set enrichment analysis and transcription factor analysis of breast cancer by integrating 14 gene expression datasets. Oncotarget. 2017b;8:6775–86. https://doi.org/10.18632/oncotarget.14286.
 20.
Daino K, Imaoka T, Morioka T, Tani S, Iizuka D, Nishimura M, et al. Loss of the BRCA1interacting helicase BRIP1 results in abnormal mammary Acinar morphogenesis. PLoS One. 2013;8:e74013. https://doi.org/10.1371/journal.pone.0074013.
 21.
Marsden CG, Jensen RB, Zagelbaum J, Rothenberg E, Morrical SW, Wallace SS, et al. The tumorassociated variant RAD51 G151D induces a hyperrecombination phenotype. PLoS Genet. 2016;12:e1006208. https://doi.org/10.1371/journal.pgen.1006208.
 22.
Wan XC, Pu HL, Huang WH, Yang S, Zhang YL, Kong Z, et al. Androgeninduced miR135a acts as a tumor suppressor through downregulating RBAK and MMP11, and mediates resistance to androgen deprivation therapy. Oncotarget. 2016;7:51284–300. https://doi.org/10.18632/oncotarget.9992.
 23.
Rasmussen RD, Gajjar MK, Tuckova L, Jensen KE, MayaMendoza A, Holst CB, et al. BRCA1regulated RRM2 expression protects glioblastoma cells from endogenous replication stress and promotes tumorigenicity. Nat Commun. 2018;9:5396. https://doi.org/10.1038/s41467018078926.
 24.
Lee J, Gollahon L. Mitotic perturbations induced by Nek2 overexpression require interaction with TRF1 in breast cancer cells. Cell Cycle. 2013;12:3599–614. https://doi.org/10.4161/cc.26589.
 25.
Nunes AM, Wuebbles RD, Sarathy A, Fontelonga TM, Deries M, Burkin DJ, et al. Impaired fetal muscle development and JAKSTAT activation mark disease onset and progression in a mouse model for merosindeficient congenital muscular dystrophy. Hum Mol Genet. 2017;26:2018–33. https://doi.org/10.1093/hmg/ddx083.
 26.
Zhao JK, Ou BC, Han DP, Wang PX, Zong YP, Zhu CC, et al. Tumorderived CXCL5 promotes human colorectal cancer metastasis through activation of the ERK/Elk1/snail and AKT/GSK3β/βcatenin pathways. Mol Cancer. 2017;16(1):70. https://doi.org/10.1186/s1294301706294.
Acknowledgements
The authors thank Professor Ding Yong for help in data analysis. The authors thank Dr. Wu Jing for suggestions and corrections that improved the text.
Funding
This work has been supported by the Innovation Foundation of Nanjing Medical University (2014NJMU035) and Nanjing Medical Science and Technology Development Fund “Youth Project Talent Training Special Funds” (QRX11033).
Author information
Affiliations
Contributions
Chengyou Liu contributed to article writing. Chengyou Liu, Junlin Zhu and Hongbing Jiang designed the study and guided the experiment. Leilei Zhou, Yuhe Wang and Shuchang Tian devoted themselves to data collection. Hang Qin provided fund support. Yong Ding provides technical support. All authors were responsible for experimental design and proofread the final version of manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Liu, C., Zhou, L., Wang, Y. et al. UGM: a more stable procedure for largescale multiple testing problems, new solutions to identify oncogene. Theor Biol Med Model 16, 20 (2019). https://doi.org/10.1186/s1297601901171
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1297601901171
Keywords
 Differentially expressed genes
 False discovery rate
 Standard deviation
 RNASeq data
 Root mean square error
 Cancerassociated genes