Skip to main navigation menu Skip to main content Skip to site footer

Peer Reviewed Article

Vol. 3 (2016)

Biclustering of Omics Data using Rectified Factor Networks

Published
2016-02-28

Abstract

Biclustering has effectively been employed in biological sciences and e-commerce for medication design and recommender systems, respectively, and has become a prominent technique for evaluating big datasets presented as a matrix of samples times attributes. One of the most successful biclustering methods, Factor Analysis for Bicluster Acquisition (FABIA), is a generative model in which each bicluster is represented by two sparse membership vectors: one for the samples and one for the features. Due to the high computational complexity of computing the posterior, FABIA is limited to approximately 20 code units. Additionally, code units are not always sufficiently decorrelated, making sample membership difficult to determine. To circumvent the limitations of existing biclustering approaches, we propose using the recently introduced unsupervised Deep Learning algorithm Rectified Factor Networks (RFNs). RFNs use their posterior means to efficiently build exceedingly sparse, non-linear, high-dimensional representations of the input. RFN learning is a generalized alternating minimization approach that ensures non-negative and normalized posterior means and is based on the posterior regularization method. Each code unit represents a bicluster, consisting of samples for which the coding unit is active and features for which the code unit has activating weights. RFN beat 13 other biclustering algorithms, including FABIA, on four hundred benchmark datasets and three gene expression datasets with identified clusters. RFN was able to detect DNA sequences that imply interbreeding with other hominins that began before modern humans' ancestors left Africa, based on data from the 1000 Genomes Project.

References

  1. Ahmed, A. A. A. (2012). Disclosure of Financial Reporting and Firm Structure as a Determinant: A Study on the Listed Companies of DSE. ASA University Review, 6(1), 43-60. https://doi.org/10.5281/zenodo.4008273
  2. Ahmed, A. A. A., & Dey, M. M. (2009). Corporate Attribute and the Extent of Disclosure: A Study of Banking Companies in Bangladesh. Proceedings of the 5th International Management Accounting Conference (IMAC), OCT 19-21, 2009, UKM, Kuala Lumpur, MALAYSIA, Pages: 531-553. https://publons.com/publon/11427801/
  3. Ahmed, A. A. A., & Dey, M. M. (2010). Accounting Disclosure Scenario: An Empirical Study of the Banking Sector of Bangladesh. Accounting and Management Information Systems, 9(4), 581-602. https://doi.org/10.5281/zenodo.4008276
  4. Azad, M. R., Khan, W., & Ahmed, A. A. A. (2011). HR Practices in Banking Sector on Perceived Employee Performance: A Case of Bangladesh. Eastern University Journal, 3(3), 30–39. https://doi.org/10.5281/zenodo.4043334
  5. Ben-Dor, A. et al. (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J. Comput. Biol., 10, 373–384.
  6. Bertsekas, D.P. (1976) On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automat. Control, 21, 174–184.
  7. Browning, B.L. and Browning, S.R. (2011) A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet, 88, 173–182.
  8. Bynagari, N. B. (2014). Integrated Reasoning Engine for Code Clone Detection. ABC Journal of Advanced Research, 3(2), 143-152. https://doi.org/10.18034/abcjar.v3i2.575
  9. Bynagari, N. B. (2015). Machine Learning and Artificial Intelligence in Online Fake Transaction Alerting. Engineering International, 3(2), 115-126. https://doi.org/10.18034/ei.v3i2.566
  10. Chekouo, T. et al. (2015). The gibbs-plaid biclustering model. Ann. Appl. Stat., 9, 1643–1670.
  11. Cheng, Y. and Church, G.M. (2000) Biclustering of expression data. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, Vol. 8, San Diego, U.S.A., pp. 93–103.
  12. Clevert, D.A. et al. Rectified factor networks. (2015) In: Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. and Garnett, R. (eds.) Advances in Neural Information Processing Systems 28 (NIPS), 2015, Montreal, Canada, Curran Associates, Inc.
  13. Donepudi, P. K. (2014). Voice Search Technology: An Overview. Engineering International, 2(2), 91-102. https://doi.org/10.18034/ei.v2i2.502
  14. Donepudi, P. K. (2014a). Technology Growth in Shipping Industry: An Overview. American Journal of Trade and Policy, 1(3), 137-142. https://doi.org/10.18034/ajtp.v1i3.503
  15. Donepudi, P. K. (2015). Crossing Point of Artificial Intelligence in Cybersecurity. American Journal of Trade and Policy, 2(3), 121-128. https://doi.org/10.18034/ajtp.v2i3.493
  16. Ganapathy, A. (2015). AI Fitness Checks, Maintenance and Monitoring on Systems Managing Content & Data: A Study on CMS World. Malaysian Journal of Medical and Biological Research, 2(2), 113-118. https://doi.org/10.18034/mjmbr.v2i2.553
  17. Ganchev, K. et al. (2010) Posterior regularization for structured latent variable models. J. Mach. Learn. Res., 11, 2001–2049.
  18. Gunawardana, A. and Byrne, W. (2005) Convergence theorems for generalized alternating minimization procedures. J. Mach. Learn. Res., 6, 2049–2073.
  19. Gusev, A. et al. (2009) Whole population, genome-wide mapping of hidden relatedness. Genome Res., 19, 318–326.
  20. Hochreiter, S. (2013) HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data. Nucleic Acids Res., 41, e202.
  21. Hochreiter, S. et al. (2010) FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26, 1520–1527.
  22. Hoshida, Y. et al. (2007) Subclass mapping: Identifying common subtypes in independent disease data sets. PLoS One, 2, e1195.
  23. Hoyer, P.O. (2004) Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 1457–1469.
  24. Ihmels, J. et al. (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics, 20, 1993–2003.
  25. Kasim, A. et al. (2016) Applied Biclustering Methods for Big and High-Dimensional Data Using R. Chapman and Hall/CRC.
  26. Kelley, C.T. (1999) Iterative Methods for Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia.
  27. Kluger, Y. et al. (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res., 13, 703–716.
  28. Kolar, M. et al. Minimax localization of structural information in large noisy matrices. (2011) In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F. and Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 909–917. Curran Associates, Inc.
  29. Lazzeroni, L. and Owen, A. (2002) Plaid models for gene expression data. Stat. Sinica, 12, 61–86.
  30. Lee, J.D. et al. (2015) Evaluating the statistical significance of biclusters. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M. and Garnett, R. (eds.) Advances in Neural Information Processing Systems 28 (NIPS), 2015, Montreal, Canada, pp. 1324–1332. Curran Associates, Inc.
  31. Madeira, S.C. and Oliveira,A.L. (2004) Biclustering algorithms for biological data analysis: a survey. IEEE ACM Trans. Comput. Biol. Bioinform., 1, 24–45.
  32. Manavalan, M. (2014). Fast Model-based Protein Homology Discovery without Alignment. Asia Pacific Journal of Energy and Environment, 1(2), 169-184. https://doi.org/10.18034/apjee.v1i2.580
  33. Manavalan, M., & Bynagari, N. B. (2015). A Single Long Short-Term Memory Network can Predict Rainfall-Runoff at Multiple Timescales. International Journal of Reciprocal Symmetry and Physical Sciences, 2, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/39
  34. Manavalan, M., & Bynagari, N. B. (2015). A Single Long Short-Term Memory Network can Predict Rainfall-Runoff at Multiple Timescales. International Journal of Reciprocal Symmetry and Physical Sciences, 2, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/39
  35. Manavalan, M., & Ganapathy, A. (2014). Reinforcement Learning in Robotics. Engineering International, 2(2), 113-124. https://doi.org/10.18034/ei.v2i2.572
  36. Meyer, M. et al. (2012) A high-coverage genome sequence from an archaic denisovan individual. Science, 338, 222–226.
  37. Murali, T.M. and Kasif, S. (2003) Extracting conserved gene expression motifs from gene expression data. In Pacific Symposium on Biocomputing, pp. 77ges.
  38. Neal, R. and Hinton, G.E. (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models. MIT Press, Cambridge, MA, pp. 355–368.
  39. O’Connor, L. and Feizi, S. (2014) Biclustering using message passing. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D. and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 27 (NIPS), 2014, Montreal, Canada, Curran Associates, Inc., pp. 3617–3625.
  40. Povysil, G. and Hochreiter, S. (2014) Sharing of Very Short IBDSegments between Humans, Neandertals, and Denisovans. bioRxiv. doi: 10.1101/003988.
  41. Povysil, G. and Hochreiter, S. (2016) IBD Sharing between Africans, Neandertals, and Denisovans. Genome Biol. Evol., 8, 3406.
  42. Prelic, A. et al. (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22, 1122–1129.
  43. Pru¨fer, K. et al. (2014) The complete genome sequence of a Neanderthal from the Altai Mountains. Nature, 505, 43–49.
  44. Rosenwald, A. et al. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med., 346, 1937–1947.
  45. Rouf, M. A., Hasan, M. S., & Ahmed, A. A. A. (2014). Financial Reporting Practices in the Textile Manufacturing Sectors of Bangladesh. ABC Journal of Advanced Research, 3(2), 125-136. https://doi.org/10.18034/abcjar.v3i2.38
  46. Srivastava, N. et al. (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15, 1929–1958.
  47. Su, A.I. et al. (2002) Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. USA, 99, 4465–4470.
  48. Tanay, A. et al. (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(Suppl. 1), S136–S144. The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526, 68–74. ISSN 0028-0836.
  49. Turner, H. et al. (2003) Improved biclustering of microarray data demonstrated through systematic performance tests. Comput. Stat. Data Anal., 48: 235–254.
  50. van’t Veer, L.J. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536.
  51. Verbist, B. et al. (2015) Using transcriptomics to guide lead optimization in drug discovery projects: lessons learned from the QSTAR project. Drug Discov. Today, 20, 505–513. ISSN 1359-6446.
  52. Xiong, M. et al. (2014) Identification of transcription factors for drug-associated gene modules and biomedical implications. Bioinformatics, 30, 305–309.
  53. Yang, J. et al. (2005). An improved biclustering method for analyzing gene expression profiles. Int. J. Artif. Intell. Tools, 14, 771–790.