Why Multiple Imputation is Indefensible for Handling Missing Data

Diogo Ribeiro

Lead Data Scientist and Research - Mathematician - Invited Professor - Open to collaboration with academics

Published Oct 18, 2024

In the world of data analysis, handling missing data is a challenge we all face. One popular solution is multiple imputation—a method that creates several versions of your data set, imputes missing values differently in each one, and combines the results. On the surface, this approach seems robust and reliable, but upon closer examination, significant issues arise that make it theoretically indefensible.

The Problem with Multiple Imputation

At its core, multiple imputation relies on a fundamental assumption: that the unknown parameters governing the missing data are drawn from some hypothetical "super-population." But here’s the catch—this super-population doesn’t exist in reality. It's a theoretical construct with no empirical grounding, meaning the process of stochastically imputing values is not based on objective reality.

The result? We are left with an average of results from multiple imputation models, but it’s unclear what this mean actually represents. It’s a mean of estimates drawn from hypothetical realities, none of which we can empirically verify. This lack of grounding in observable data presents a severe problem for scientific rigor and interpretability.

Unfalsifiability and Interpretation

Another issue is that multiple imputation isn’t falsifiable. The imputation models are based on assumptions that we can’t test or validate against real-world data. This makes it impossible to investigate whether the models reflect the true data-generating process. We can't assess their accuracy, let alone correct them if they're wrong. In science, falsifiability is crucial—our methods must be open to scrutiny and potential disproof. Multiple imputation simply doesn't allow for that.

An Alternative: Single Stochastic Imputation and Sensitivity Analysis

What can we use instead? A more defensible approach is single stochastic imputation combined with deterministic sensitivity analysis. Here’s why:

Single stochastic imputation involves imputing missing values just once, based on a single model tied directly to the observed data. This model reflects a clear assumption about what the data would look like if nothing were missing. It’s transparent, interpretable, and, most importantly, connected to reality.
Deterministic sensitivity analysis allows us to explore different missing data models and compare the results. Instead of averaging across multiple hypothetical worlds, we can see how our analysis changes under different reasonable assumptions. This gives us a clearer picture of uncertainty while maintaining transparency.

Keeping Science Falsifiable

By embracing single stochastic imputation and sensitivity analysis, we keep our methods grounded in empirical reality, preserving the ability to falsify and revise our models. In science, it’s not enough for results to be plausible—they must be true. And the only way to ensure that is by relying on methods that keep science falsifiable and transparent.

Let’s move beyond multiple imputation and adopt methods that give us results we can truly trust.

I hope this sparks some valuable conversations about how we handle missing data and the methodologies we trust in statistical analysis. What are your thoughts? Do you use multiple imputation, or are you open to alternatives?

#DataScience #StatisticalAnalysis #MissingData #Imputation #DataAnalysis #ResearchMethods

Blaine Bateman, EAF

Chief Data Scientist at EAF LLC

2mo

Can you simplify the point you are making here? How does any model built on the remaining data (observations) differ, as long as we don’t use the target variable (which is tantamount to information leakage)?

Why Multiple Imputation is Indefensible for Handling Missing Data

Diogo Ribeiro

Lead Data Scientist and Research - Mathematician - Invited Professor - Open to collaboration with academics

The Problem with Multiple Imputation

Unfalsifiability and Interpretation

An Alternative: Single Stochastic Imputation and Sensitivity Analysis

Keeping Science Falsifiable

More articles by this author

Insights from the community

Others also viewed

Dimension Reduction Linear Discriminant Analysis

Seeking examples of mundane low-stakes data modeling fails

Decrypting Data FAIRfication Methodologies: The Practical Handbook

Data vs. Knowledge: Navigating the Information Spectrum

How to not fall for lies using data?

K-NN algorithm few facts

Correlation Does Not Imply Causation

Where Analytics, Data Science, Machine Learning Were Applied: Trends and Analysis

Mishandling Missing Values @ DS ML models

Counterfeit Knowledge Graphs

Explore topics

The Problem with Multiple Imputation

Unfalsifiability and Interpretation

An Alternative: Single Stochastic Imputation and Sensitivity Analysis

Keeping Science Falsifiable

Interpreting the Intercept in Regression Models

Nov 8, 2024

Exploring Logistic Regression Models

Nov 1, 2024

Making Sense of Statistical Terms: A Guide to Skewness, Variance, and More

Oct 30, 2024

Who Can Truly Fix Post-Deployment Issues with ML Models?

Oct 25, 2024

A/B Testing: The Key to Data-Driven Decision Making

Oct 22, 2024

Choosing the Right Statistical Test: A Practical Guide for Data-Driven Decision Making

Oct 19, 2024

Rust in Data Science: Is It the Next Frontier?

Oct 18, 2024

Is JavaScript the Future of Data Science? Exploring Its Role in the Data Science

Oct 9, 2024

Apache Flink: Real-Time Data Processing at Scale

Oct 7, 2024

Understanding the Common Ground Between Linear and Logistic Regression in Data Science

Oct 4, 2024

Insights from the community

Others also viewed

Dimension Reduction Linear Discriminant Analysis

Seeking examples of mundane low-stakes data modeling fails

Decrypting Data FAIRfication Methodologies: The Practical Handbook

Data vs. Knowledge: Navigating the Information Spectrum

How to not fall for lies using data?

K-NN algorithm few facts

Correlation Does Not Imply Causation

Where Analytics, Data Science, Machine Learning Were Applied: Trends and Analysis

Mishandling Missing Values @ DS ML models

Counterfeit Knowledge Graphs

Explore topics