Skip to main content
Log in

Clustering with missing data: which equivalent for Rubin’s rules?

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vincent Audigier.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Partitions pooling

See Tables 3 and 4.

Table 3 Accuracy of the clustering procedure under a MCAR mechanism: average adjusted rand index (mean) and interquartile range (IQ) over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) for various imputation methods (JM-DP or FCS-RF), various consensus methods (NMF or SAOM)
Table 4 Accuracy of the clustering procedure under a MAR mechanism: average adjusted rand index (mean) and interquartile range (IQ) over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) for various imputation methods (JM-DP or FCS-RF), various consensus methods (NMF or SAOM)

1.2 Instability pooling

See Tables 5 and 6.

Table 5 Instability of the clustering procedure under a MCAR mechanism: average within-instability (\(\bar{U}\)) average between-instability (B) and average total instability (T) over the \({S}=200\) generated data sets for various number of individuals (\({n}\)), correlation between variables (\(\rho \)) and proportion of missing values (\(\tau \))
Table 6 Instability of the clustering procedure under a MAR mechanism: average within-instability (\(\bar{U}\)) average between-instability (B) and average total instability (T) over the \({S}=200\) generated data sets for various number of individuals (\({n}\)), correlation between variables (\(\rho \)) and proportion of missing values (\(\tau \))

1.3 Influence of \({M}\)

1.3.1 Accuracy

See Figs. 7, 8 and 9.

Fig. 7
figure 7

Accuracy of the clustering procedure according to \({M}\): adjusted rand index over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MAR mechanism. Data sets are imputed by JM-DP varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

Fig. 8
figure 8

Accuracy of the clustering procedure according to \({M}\): adjusted rand index over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MCAR mechanism. Data sets are imputed by FCS-RF varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

Fig. 9
figure 9

Accuracy of the clustering procedure according to \({M}\): adjusted rand index over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MAR mechanism. Data sets are imputed by FCS-RF varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

1.3.2 Instability

See Figs. 10, 11 and 12.

Fig. 10
figure 10

Instability according to \({M}\): total instability (T) over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MAR mechanism. Data sets are imputed by JM-DP varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

Fig. 11
figure 11

Instability according to \({M}\): total instability (T) over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MCAR mechanism. Data sets are imputed by FCS-RF varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

Fig. 12
figure 12

Instability according to \({M}\): total instability (T) over the \({S}=200\) generated data sets varying by the number of individuals (\({n}\)), the correlation between variables (\(\rho \)) and the proportion of missing values (\(\tau \)) generated under a MAR mechanism. Data sets are imputed by FCS-RF varying by the number of imputed data sets (\({M}\)). For each data set, clustering is performed using k-means clustering and consensus clustering is performed using NMF

1.4 Application

See Table 7.

Table 7 Animals data set

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Audigier, V., Niang, N. Clustering with missing data: which equivalent for Rubin’s rules?. Adv Data Anal Classif 17, 623–657 (2023). https://doi.org/10.1007/s11634-022-00519-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-022-00519-1

Keywords

Mathematics Subject Classification

Navigation