Abstract
Given a matrix X composed of symbols, a bicluster is a submatrix of X obtained by removing some of the rows and some of the columns of X in such a way that each row of what is left reads the same string. In this paper, we are concerned with the problem of finding the bicluster with the largest area in a large matrix X. The problem is first proved to be NP-complete. We present a fast and efficient randomized algorithm that discovers the largest bicluster by random projections. A detailed probabilistic analysis of the algorithm and an asymptotic study of the statistical significance of the solutions are given. We report results of extensive simulations on synthetic data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1998). ACM SIGMOD Record, vol.27(2), pp. 94–105. ACM Press, New York (1998)
Aggarwal, C.C., Procopiuc, C., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1999). SIGMOD Record, vol.28(2), pp. 61–72. ACM Press, New York (1999)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB 2000), pp. 93–103. AAAI Press, Menlo Park (2000)
Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD 2002), pp. 394–405. ACM Press, New York (2002)
Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. In: Proceedings of the 10th International Conference on Intelligent Systems for Molecular Biology (ISMB 2002), in Bioinformatics, vol.18,pp.S136–S144 (2002)
Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: The order-preserving submatrix problem. In: Proceedings of Sixth International Conference on Computational Molecular Biology (RECOMB 2002), pp. 45–55. ACM Press, New York (2002)
Zhang, L., Zhu, S.: A new clustering method for microarray data analysis. In: Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB 2002), pp. 268–275. IEEE Press (, Los Alamitos (2002)
Sheng, Q., Moreau, Y., Moor, B.D.: Biclustering microarray data by Gibbs sampling. In: Proceedings of European Conference on Computational Biology (ECCB 2003) (2003) (to appear)
Mishra, N., Ron, D., Swaminathan, R.: On finding large conjunctive clusters. In: Proc. of the ACM Conference on Computational Learning Theory (COLT 2003) (2003) (to appear)
Procopiuc, M., Jones, M., Agarwal, P., Murali, T.M.: A Monte-Carlo algorithm for fast projective clustering. In: Proceedings of the 2002 International Conference on Management of Data (SIGMOD 20 02), pp. 418–427 (2002)
Hartigan, J.A.: Direct clustering of a data matrix. Journal of the American Statistical Association 67, 123–129 (1972)
Murali, T.M., Kasif, S.: Extracting conserved gene expression motifs from gene expression data. In: Proceedings of the Pacific Symposium on Biocomputing (PSB 2003), pp. 77–88 (2003)
Hochbaum, D.S.: Approximating clique and biclique problems. Journal of Algorithms 29, 174–200 (1998)
Garey, M.R., Johnson, D.S.: Computers and intractability: a guide to the theory of NP-completeness. Freeman, New York (1979)
Grigni, M., Manne, F.: On the complexity of the generalized block distribution. In: Saad, Y., Yang, T., Ferreira, A., Rolim, J.D.P. (eds.) IRREGULAR 1996. LNCS, vol. 1117, pp. 319–326. Springer, Heidelberg (1996)
Peeters, R.: The maximum-edge biclique problem is NP-complete. Technical Report 789, Tilberg University: Faculty of Economics and Business Adminstration (2000)
Dawande, M., Keskinocak, P., Swaminathan, J.M., Tayur, S.: On bipartite and multipartite clique problems. Journal of Algorithms 41, 388–403 (2001)
Pasechnik, D.V.: Bipartite sandwiches. Technical report (1999), available at http://arXiv.org/abs/math.CO/9907109
Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and statistical properties of words: An overview. J. Comput. Bio. 7, 1–46 (2000)
Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staudt, L., Chan, W., Botstein, D., Brown, P.: Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1, 1–21 (2000)
Lazzeroni, L., Owen, A.: Plaid models for gene expression data. Statistica Sinica 12, 61–86 (2002)
Kluger, Y., Basri, R., Chang, J., Gerstein, M.: Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res. 13, 703–716 (2003)
Yang, J., Wang, H., Wang, W., Yu, P.S.: Enhanced biclustering on gene expression data. In: IEEE Symposium on Bioinformatics and Bioengineering, (BIBE 2003) (2003) (to appear)
Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological networks and gene expression data. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB 2002), pp. 145–154. AAAI Press, Menlo Park (2002)
Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). ACM Press,Newyork (2001) (to appear)
Gu, M., Zha, H., Ding, C., He, X., Simon, H.: Spectral relaxation models and structure analysis for k-way graph clustering and bi-clustering. Technical Report CSE-01-007, Department of Computer Science and Engineering, Pennsylvania State University (2001)
Dhillon, I.: Co-clustering documents and words using bipartite spectral graph parititioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 269–274. ACM Press, New York (2001)
Szpankowski, W.: Average Case Analysis of Algorithms on Sequences. Wiley Interscience, Hoboken (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lonardi, S., Szpankowski, W., Yang, Q. (2004). Finding Biclusters by Random Projections. In: Sahinalp, S.C., Muthukrishnan, S., Dogrusoz, U. (eds) Combinatorial Pattern Matching. CPM 2004. Lecture Notes in Computer Science, vol 3109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27801-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-27801-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22341-2
Online ISBN: 978-3-540-27801-6
eBook Packages: Springer Book Archive