Linear regression (LR) and generalized linear models (GLM) are two great tools in data science, but understanding their difference is key to mastering their application and maximizing analytical insights. While both model relationships between variables, they differ in assumptions, formulations, and handling of data variability. Understanding these differences is crucial for effective analysis. LR assumes normally distributed continuous response variables with constant variance. GLM relaxes these assumptions, accommodating various response variable types and distributions through link functions. Adjustment to Data Variability: LR struggles with heteroscedasticity and non-normal errors, while GLM handles them by choosing appropriate error distributions and link functions. GLM also specializes in modeling categorical and count data, which LR cannot handle well. LR and GLM offer distinct advantages in statistical modeling. By understanding their differences and how they adjust to data variability, analysts can make more informed decisions and draw reliable conclusions from their data. - You can check this post from Daily Dose of Data Science from Avi Chawla for a deeper understanding and a great reading resource!!
Actually, for the usual OLS regression, heteroskedasticity can be easily dealt with by simply using heteroskedasticity-robust standard errors. Another way is to use feasible generalized least squares (FGLS, which is different from GLM!) instead. Non-normality of errors is also not usually an issue for statistical inference if you have a large sample size as your estimators will still be asymptotically normally distributed (although it is true that GLMs can give a better fit, especially if the outcome variable is discrete). An additional fun fact is that OLS regression is just a special case of a GLM where the link function is an identity function and where the conditional distribution of the outcome variable (conditioned on the independent variables) is a normal distribution.
Very interesting, haven’t seen this breakdown before
great to have a visual version too!
Co-founder @ Daily Dose of Data Science (120k readers) | Follow to learn about Data Science, Machine Learning Engineering, and best practices in the field.
8moWonderful breakdown, Isaac Pacheco Rojas. Glad you loved the article. This is how we can breakdown the assumptions of LR (depicted in the image): - Firstly, it assumes that the conditional distribution of Y given X is a Gaussian. - Next, it assumes a very specific form for the mean of the above Gaussian. It says that the mean should always be a linear combination of the features (or predictors). - Lastly, it assumes a constant variance for the conditional distribution P(Y|X) across all levels of X. Generalized linear models (GLMs) relax all these assumptions, which makes them more adaptable to real-world datasets.