Why Multiple Imputation is Indefensible for Handling Missing Data
In the world of data analysis, handling missing data is a challenge we all face. One popular solution is multiple imputation—a method that creates several versions of your data set, imputes missing values differently in each one, and combines the results. On the surface, this approach seems robust and reliable, but upon closer examination, significant issues arise that make it theoretically indefensible.
The Problem with Multiple Imputation
At its core, multiple imputation relies on a fundamental assumption: that the unknown parameters governing the missing data are drawn from some hypothetical "super-population." But here’s the catch—this super-population doesn’t exist in reality. It's a theoretical construct with no empirical grounding, meaning the process of stochastically imputing values is not based on objective reality.
The result? We are left with an average of results from multiple imputation models, but it’s unclear what this mean actually represents. It’s a mean of estimates drawn from hypothetical realities, none of which we can empirically verify. This lack of grounding in observable data presents a severe problem for scientific rigor and interpretability.
Unfalsifiability and Interpretation
Another issue is that multiple imputation isn’t falsifiable. The imputation models are based on assumptions that we can’t test or validate against real-world data. This makes it impossible to investigate whether the models reflect the true data-generating process. We can't assess their accuracy, let alone correct them if they're wrong. In science, falsifiability is crucial—our methods must be open to scrutiny and potential disproof. Multiple imputation simply doesn't allow for that.
An Alternative: Single Stochastic Imputation and Sensitivity Analysis
What can we use instead? A more defensible approach is single stochastic imputation combined with deterministic sensitivity analysis. Here’s why:
Single stochastic imputation involves imputing missing values just once, based on a single model tied directly to the observed data. This model reflects a clear assumption about what the data would look like if nothing were missing. It’s transparent, interpretable, and, most importantly, connected to reality.
Deterministic sensitivity analysis allows us to explore different missing data models and compare the results. Instead of averaging across multiple hypothetical worlds, we can see how our analysis changes under different reasonable assumptions. This gives us a clearer picture of uncertainty while maintaining transparency.
Keeping Science Falsifiable
By embracing single stochastic imputation and sensitivity analysis, we keep our methods grounded in empirical reality, preserving the ability to falsify and revise our models. In science, it’s not enough for results to be plausible—they must be true. And the only way to ensure that is by relying on methods that keep science falsifiable and transparent.
Let’s move beyond multiple imputation and adopt methods that give us results we can truly trust.
I hope this sparks some valuable conversations about how we handle missing data and the methodologies we trust in statistical analysis. What are your thoughts? Do you use multiple imputation, or are you open to alternatives?
#DataScience #StatisticalAnalysis #MissingData #Imputation #DataAnalysis #ResearchMethods
Chief Data Scientist at EAF LLC
2moCan you simplify the point you are making here? How does any model built on the remaining data (observations) differ, as long as we don’t use the target variable (which is tantamount to information leakage)?