Howard Friedman’s Post

Leading Success using Data: Chief Data Scientist, Health Economist, Columbia University - my latest book is Winning with Data Science

4mo

One of the most common questions I get is "𝐌𝐲 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐯𝐞 𝐦𝐨𝐝𝐞𝐥 𝐢𝐬𝐧'𝐭 𝐰𝐨𝐫𝐤𝐢𝐧𝐠 𝐰𝐞𝐥𝐥 𝐞𝐧𝐨𝐮𝐠𝐡...𝐰𝐡𝐚𝐭 𝐬𝐡𝐨𝐮𝐥𝐝 𝐈 𝐝𝐨?" If model performance is disappointing, there are three main levers that we can pull to try improving the performance. 🔷 The first and most powerful lever is changing the data that the model is using. We can add more features to the model or by transforming the features that we’ve already included. In my experience, this is the most powerful of the levers. 🔷 Another lever we can pull is changing the type of model or the type of feature selection. If we see that a regression model isn’t working well, we can try a decision tree, for example. We can also try using a penalized regression model that performs feature selection automatically during the modeling process. 🔷 The other lever is tuning the hyperparameters. A hyperparameter is like a setting knob on a model. It adjusts the model's rules so changes in the hyperparameters produce models with very different results. Any combination of these three levers may be used to improve model performance. Depending on what data is accessible, it may not be feasible to add more features, so the data scientist must rely on hyperparameter tuning and model selection to improve the quality of predictions.

4 Comments

Adam Swinburn

Helping organizations make better decisions with data

4mo

Agree with #1 especially. Getting the data right - even just cleaner - can make a big difference.

1 Reaction

Mitchell Maltenfort

Statistician at Children's Hospital of Philadelphia

4mo

“Working well” can mean different things to different people. I was working on a predictive model of a surgical complication with a rate of a few percent. Using calibration curves, I was able to show my model had good association between predicted and actual rates up to 20-25 percent, but not beyond. While I was initially disappointed, collaborators were delighted - the model could identify who was at relatively high risk!

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Howard Friedman

Leading Success using Data: Chief Data Scientist, Health Economist, Columbia University - my latest book is Winning with Data Science
6mo
Report this post
🔹 What do you do if your model is not working well? 🔹 What changes can you make to improve model performance? If 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 is disappointing, there are three main levers that we can pull to try improving the performance. The first and most powerful lever is 𝐜𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚 that the model is using. We can add more features to the model or transform the features that we’ve already included. In my experience, this is the most powerful of the levers. Another lever we can pull is 𝐜𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐭𝐡𝐞 𝐭𝐲𝐩𝐞 𝐨𝐟 𝐦𝐨𝐝𝐞𝐥or the type of feature selection. If we see that a regression model isn’t working well, we can try a decision tree, for example. We can also try using a penalized regression model that performs feature selection automatically during the modeling process. The other lever is tuning the 𝐡𝐲𝐩𝐞𝐫𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫𝐬. A hyperparameter is like a setting knob on a model. It adjusts the model's rules so changes in the hyperparameters produce models with very different results. Any combination of three levers may be used to improve model performance. Depending on what data is accessible, it may not be feasible to 𝐚𝐝𝐝 𝐦𝐨𝐫𝐞 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬, so the data scientist must rely on 𝐡𝐲𝐩𝐞𝐫𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫 𝐭𝐮𝐧𝐢𝐧𝐠 and 𝐦𝐨𝐝𝐞𝐥 𝐬𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧 to improve the quality of predictions. Keys questions to ask the data science team about model tuning include: 🔹 Can you describe the hyperparameter tuning process? 🔹 How did you determine the optimal hyperparameter values for the chosen models?
Like Comment
To view or add a comment, sign in
Abdulkabir Opeyemi

Engineering || MS Excel || PowerBI || SQL || Data analyst Enthusiast
6mo
Report this post
📇Data Preparation for a simple/ multilinear regression model A simple linear regression estimates the relationship between 2 quantitative variables where one is referred to as independent(X) while the other variable is dependent(Y), in multilinear regression the predictor is usually two or more. To have a better model(output) it is important to have data well prepared(which is the input) 👇Steps to follow in preparing the data: -Handle missing values -Encode categorically variables -split data into training and testing set, to train the model on one set and evaluate it's performance on an unseen set. -Normalize or standardize numerical features to ensure consistent scaling -Remove or treat outliers to prevent skewed results -Check for multicollinearity and address highly correlated predictors -Feature selection to retain only the most relevant variables -Ensure data is clean and free from errors or inconsistencies -Create new features if necessary to better capture underlying patterns -Verify data types are appropriate for analysis. 🤝Whenever I work on a dataset, my first priorities are handling missing values, assessing the dataset's shape, and closely examining for outliers. I believe these factors can significantly impact your model's effectiveness. I'd love to hear what key steps you focus on during the preprocessing stage!
Like Comment
To view or add a comment, sign in
Brandon Gillins

Data Analyst | Pioneering AI & Machine Learning Solutions in Healthcare | Active Member of Optum Data Science Guild
1w
Report this post
Let's Talk about ARIMA. There are more than 3 assumptions but these are the big ones. 1. Stationarity: The Foundation of Reliable Forecasts 🏗️ What is Stationarity? A stationary time series has constant statistical properties over time, meaning its mean, variance, and autocorrelation remain unchanged. This stability is crucial because ARIMA models rely on these consistent patterns to make accurate predictions. Why It Matters: Predictability: Non-stationary data with trends or seasonality can lead to misleading forecasts. Model Accuracy: ARIMA assumes no underlying trends or seasonal effects unless explicitly modeled. 2. Autocorrelation: Leveraging Past to Predict Future 🔄 What is Autocorrelation? Autocorrelation measures the relationship between current and past values in a time series. It’s the backbone of the AutoRegressive (AR) component in ARIMA, which uses past observations to predict future values. Why It Matters: Strong Autocorrelation: Indicates that past values are good predictors of future values, enhancing ARIMA’s performance. Weak or No Autocorrelation: Suggests that past values have little influence, making ARIMA less effective. 3. White Noise Residuals: Ensuring No Leftover Patterns 🧩 What are White Noise Residuals? After fitting an ARIMA model, residuals (errors) should resemble white noise—completely random with no discernible patterns. This indicates that the model has captured all the underlying structures in the data. Why It Matters: Model Completeness: Patterns in residuals suggest that the model hasn’t fully captured the data’s dynamics. Forecast Reliability: White noise residuals mean the forecasts are based on all available information, enhancing their reliability. Final Thoughts 💡 ARIMA is a powerful tool in the data scientist’s arsenal, but its true potential is unlocked only when its assumptions are diligently checked and validated. By understanding and adhering to these foundational principles, you can ensure your forecasts are not only accurate but also actionable.
Like Comment
To view or add a comment, sign in
Ibrahim Junaid

Data Analyst|Data Scientist |Excel|SQL|Power BI |Tableau|Python|R |I help companies to draw insights, build models and make right decisions through valid data analytics to drive their rapid growth and development.
6mo
Report this post
Simple Linear Regression - Model Assumptions The Linear Regression Model is based on several assumptions, which are listed below: i. Linear relationship The relationship between response and feature variables should be linear. This linear relationship assumption can be tested by plotting a scatter plot between response and feature variables. ii. Multivariate normality The linear regression model requires all variables to be multivariate normal. A multivariate normal distribution means a vector in multiple normally distributed variables, where any linear combination of the variables is also normally distributed. iii. No or little multicollinearity It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are highly correlated. iv. No auto-correlation Also, it is assumed that there is little or no auto-correlation in the data. Autocorrelation occurs when the residual errors are not independent of each other. v. Homoscedasticity Homoscedasticity describes a situation in which the error term (that is, the noise in the model) is the same across all values of the independent variables. It means the residuals are the same across the regression line. It can be checked by looking at a scatter plot.
Like Comment
To view or add a comment, sign in
Rahul Reddy Taduri

Data Scientist | Expertise in Machine Learning, Deep Learning, NLP, Predictive Analytics, ETL Pipelines & AI Solutions.
2mo
Report this post
hi...! Lets see about linear regression. It models the relationship between two variables by fitting a straight line to the data. When there are multiple predictor variables, the model becomes a multiple linear regression. The primary goal of linear regression is to predict the target variable (dependent) based on the values of the predictor variables (independent). Assumptions: 1.Linearity: The relationship between the dependent and independent variables is linear. 2.Independence: Observations are independent of each other. 3.Homoscedasticity: Constant variance of residuals across all levels of the independent variables. 4.Normality of Residuals: Residuals should be normally distributed. 5.No Multicollinearity: Independent variables should not be highly correlated. 6.No Autocorrelation: Residuals should not be correlated with each other (especially in time series data). Lets see few points: 1. What is the difference between R-squared and adjusted R-squared? R-squared measures the proportion of variance in the dependent variable explained by the independent variables. It increases with the addition of more variables. Adjusted R-squared adjusts for the number of predictors in the model, preventing it from increasing when irrelevant variables are added. 2. How do you detect multicollinearity, and how do you deal with it? Multicollinearity can be detected using the Variance Inflation Factor (VIF). If VIF > 10, multicollinearity is likely present. Remove or combine highly correlated variables, use regularization techniques like Ridge or Lasso regression. 3. How do you handle categorical variables in linear regression? Use one-hot encoding or dummy variables to convert categorical variables into binary (0 or 1) format so they can be used in the model. 4. What is heteroscedasticity, and how can it affect a linear regression model? Heteroscedasticity refers to non-constant variance of residuals. It can lead to inefficient estimates and affect the accuracy of confidence intervals and hypothesis tests. Apply transformations (e.g., log transformation) to the data or use weighted least squares regression. lets see few more in next post..! #machineLearning #Statisticsfundamentals

2 Comments
Like Comment
To view or add a comment, sign in
Jayshri Patil

SDE Intern@LinuxSocials | C++ | Data Structures and algorithm | Docker | Problem Solving
7mo
Report this post
Exploring Types of Linear Regression with Their Formulas 📊 In data analysis, linear regression is like a Swiss Army knife, offering different tools for different jobs. Here's a quick look at the main types and what they do: 1️⃣ Simple Linear Regression: 🔹 What it does: Predicts one thing based on one factor. 🔹 Formula: 𝑦=𝑚𝑥+𝑏 (where 𝑦 is what we're predicting, 𝑥 is the factor, 𝑚 is the slope, and 𝑏 is the intercept). 2️⃣ Multiple Linear Regression: 🔹 What it does: Predicts one thing based on several factors. 🔹 Formula: 𝑦=𝑏0+𝑏1𝑥1+𝑏2𝑥2+...+𝑏𝑛𝑥𝑛 (where 𝑦 is what we're predicting, 𝑥1,𝑥2,...,𝑥𝑛 are the factors, and 𝑏0,𝑏1,𝑏2,...,𝑏𝑛 are their respective coefficients). 3️⃣ Polynomial Regression: 🔹 What it does: Fits a curve to the data, not just a straight line. 🔹 Formula: 𝑦=𝑏0+𝑏1𝑥+𝑏2𝑥2+...+𝑏𝑛𝑥𝑛 (where 𝑦 is what we're predicting, 𝑥 is the factor, and 𝑏0,𝑏1,𝑏2,...,𝑏𝑛 are coefficients). 4️⃣ Ridge, Lasso, and Elastic Net Regression: 🔹 What they do: These are like bodyguards for regression, preventing it from getting too confident and making bad predictions. 🔹 They adjust the way regression works slightly to avoid making mistakes when there are lots of factors involved. Understanding these types of regression helps us make sense of different kinds of data and predict outcomes more accurately. #DataScience #MachineLearning #LinearRegression Rahul Maheshwari
Like Comment
To view or add a comment, sign in
Nicolas-Mauricio Caicedo Cambindo

Account Receivables Coordinator | FP&A | Data Analytics | Leading with the power of data
6mo
Report this post
Autoregressive Integrated Moving Average (ARIMA) models are a form of the Box-Jenkins model, which is useful for time-series data forecasting and identifying patterns and trends.

Box-Jenkins Methodology for ARIMA Models - GeeksforGeeks

geeksforgeeks.org
Like Comment
To view or add a comment, sign in
PPCexpo

2,225 followers
2mo
Report this post
Residual vs. fitted plots are crucial for diagnosing and improving regression models. Learn how these plots reveal model fit, non-linearity, and outliers. #DataAnalysis #RegressionAnalysis #ResidualPlots #ModelDiagnostics #FittedVsResidual #PredictiveAnalytics

Residual vs. Fitted Plot: What It Tells You About Your Data

ppcexpo.com

3 Comments
Like Comment
To view or add a comment, sign in
Vikash Singh

Ex Data Analyst Intern at Unified Mentors | IBM Certified Data Analyst | Advance Excel | SQL | Python | Power BI | HackerRank ⭐
10mo Edited
Report this post
#Day30 of my data analytics journey, I delved deeper into model evaluation techniques and the concept of pipelines. Here's a quick overview of my learnings: Understanding Model Evaluation: Explored various metrics for evaluating model performance, including R-squared value (coefficient of determination) and Mean Squared Error (MSE). These metrics provide insights into how well the model fits the data and its predictive accuracy. Exploring Pipelines: Learned about the concept of pipelines in machine learning, which allows for streamlining the data preprocessing and model fitting process. Pipelines help automate workflows and ensure consistency in model building and evaluation. Key Insights: By comparing R-squared values and MSE across different models (e.g., linear regression, multiple linear regression, polynomial regression), we can determine which model best fits the data and yields the most accurate predictions. Pipelines offer a convenient way to encapsulate data preprocessing steps and model fitting, simplifying the development and deployment of machine learning models. Let's continue this journey of exploration and discovery in data analytics together! #DataAnalytics #ModelEvaluation #Pipelines #ContinuousLearning
Like Comment
To view or add a comment, sign in
Priyank Barthwal

Fintech Product Leader | Strategic Visionary | Driving Revenue Growth & Customer Satisfaction
5mo
Report this post
🚀 Improving Linear Regression Models with Regularization 🚀 In our last lesson, we explored how to enhance linear regression models to avoid overfitting and improve their accuracy on new data. Here’s a simple breakdown: The Problem: Overfitting When we train a linear regression model, it might fit the training data too closely, leading to poor predictions on new data. The Solution: Regularization Regularization helps balance the model by adding a penalty for complexity, encouraging simpler models that generalize better to new data. How It Works: Sum of Squared Error (SSE): The initial cost function we try to minimize. Regularization Term: Adds a penalty to the cost function based on the number and size of the coefficients (weights). Types of Regularization: Lasso Regression: - Penalizes the sum of the absolute values of the coefficients. - Can reduce some coefficients to zero, effectively removing irrelevant features. - Acts as a form of feature selection, creating simpler models. Ridge Regression: - Penalizes the sum of the squared values of the coefficients. - Shrinks coefficients towards zero but doesn't eliminate them. - Useful when input features are correlated. Choosing Between Lasso and Ridge: Lasso: Ideal for simpler models with fewer features. Helps in feature selection by eliminating irrelevant features. Ridge: Better for complex models with many correlated features. Reduces the impact of irrelevant features without eliminating them. Key Takeaway: Regularization is a powerful technique to improve linear regression models, especially when dealing with complex data. By trying both lasso and ridge regression, you can determine which method works best for your specific modeling needs. #DataScience #MachineLearning #LinearRegression #Regularization #LassoRegression #RidgeRegression
Like Comment
To view or add a comment, sign in

9,956 followers

View Profile Follow

Howard Friedman’s Post

More from this author

How We Value Lives in a Time of Crisis (and every other day!)

Private Equity and Data Science: Due Diligence Stage

Program Evaluation: 8 Steps

Explore topics