Abstract
Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.
Similar content being viewed by others
References
Al-Najdi A, Pasquier N, Precioso F (2016) Frequent closed patterns based multiple consensus clustering. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer, Cham, pp 14–26
Andrzej CAHP, Zdunek R, ichi AS, (2009) Alternating least squares and related algorithms for NMF and SCA problems. Wiley, vol 4, pp 203–266. https://doi.org/10.1002/9780470747278.ch4, https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470747278.ch4
Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: Which imputation model for which cluster analysis method? arXiv:2106.04424
Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J (2013) A framework for multiple imputation in cluster analysis. Am J Epidemiol 177(7):718–725. https://doi.org/10.1093/aje/kws289
Belbin L, Faith DP, Milligan GW (1992) A comparison of two approaches to beta-flexible clustering. Multivar Behav Res 27(3):417–433. https://doi.org/10.1207/s15327906mbr2703_6
Bruckers L, Molenberghs G, Dendale P (2017) Clustering multiply imputed multivariate high-dimensional longitudinal profiles. Biomet J 59(5):998–1015. https://doi.org/10.1002/bimj.201500027
Chi JT, Chi EC, Baraniuk RG (2016) k-pod: a method for k-means clustering of missing data. Am Stat 70(1):91–99. https://doi.org/10.1080/00031305.2015.1086685
Day W (1986) Foreword: comparison and consensus of classifications. J Classif 3(2):183–185. https://doi.org/10.1007/BF01894187
Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal NR, Sugeno M (eds) Advances in soft computing: AFSS 2002. Springer, Berlin, pp 332–338
Doove L, van Buuren S, Dusseldorp E (2014) Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 72:92–104. https://doi.org/10.1016/j.csda.2013.10.025
Dudoit S, Fridly J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics:1090–1099
Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477. https://doi.org/10.1016/j.csda.2011.09.003
Faucheux L, Resche-Rigon M, Curis E, Soumelis V, Chevret S (2020) Clustering with missing and left-censored data: a simulation study comparing multiple-imputation-based procedures. Biomet J. https://doi.org/10.1002/bimj.201900366
Filkov V, Skiena S (2004) Integrating microarray data by consensus clustering. Int J Artif Intell Tools 13(04):863–880. https://doi.org/10.1142/S0218213004001867
Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–780
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271. https://doi.org/10.1016/j.csda.2006.11.025
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Jain A, Moreau J (1987) Bootstrap technique in cluster analysis. Pattern Recogn 20(5):547–568. https://doi.org/10.1016/0031-3203(87)90081-1
Jain BJ (2017) Consistency of mean partitions in consensus clustering. Pattern Recogn 71:26–35. https://doi.org/10.1016/j.patcog.2017.04.021
Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29(1):91–116
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley
Kim HJ, Reiter JP, Wang Q, Cox LH, Karr A (2014) Multiple imputation of missing or faulty values under linear constraints. J Bus Econ Stat 32(3):375–386. https://doi.org/10.1080/07350015.2014.885435
Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the 2007 seventh IEEE international conference on data mining, IEEE Computer Society, USA, ICDM ’07, pp 577–582. https://doi.org/10.1109/ICDM.2007.98
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) Cluster: cluster analysis basics and extensions
Marshall A, Altman D, Holder R, Royston P (2009) Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol 9(1):57
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering, vol 38. M. Dekker New York
Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Sci 10:538–573
Mourer A, Forest F, Lebbah M, Azzag H, Lacaille J (2020) Selecting the number of clusters \(k\) with a stability trade-off: an internal validation criterion. arXiv:2006.08530
Plaehn D (2019) Revisiting french tomato data: cluster analysis with incomplete data. Food Qual Pref 76:146–159. https://doi.org/10.1016/j.foodqual.2019.03.014
Rubin D (1976) Inference and missing data. Biometrika 63:581–592
Rubin D (1987) Multiple imputation for non-response in survey. Wiley, New-York
Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, London
Schafer J (2003) Multiple imputation in multivariate problems when the imputation and analysis models differ. Statistica Neerlandica 57(1):19–35
Strehl A, Ghosh J, Cardie C (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
van Buuren S (2018) Flexible imputation of missing data (Chapman & Hall/CRC Interdisciplinary Statistics). Chapman and Hall/CRC
Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. IJPRAI 25(3):337–372
Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904. https://doi.org/10.1093/biomet/asq061
Ward JHJ (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244. https://doi.org/10.1080/01621459.1963.10500845
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Audigier, V., Niang, N. Clustering with missing data: which equivalent for Rubin’s rules?. Adv Data Anal Classif 17, 623–657 (2023). https://doi.org/10.1007/s11634-022-00519-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-022-00519-1