Principal Component Analysis - Intro - Towards Data Science
Principal Component Analysis - Intro - Towards Data Science
Principal Component Analysis - Intro - Towards Data Science
Too many variables? Should you be using all possible variables to generate model?
In order to handle “curse of dimensionality” and avoid issues like over-fitting in high
dimensional space, methods like Principal Component analysis is used.
https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 1/4
9/7/2019 Principal Component Analysis- Intro - Towards Data Science
Start by normalizing the predictors by subtracting the mean from each data point. It is
important to normalize the predictor as original predictors can be on the different scale
and can contribute significantly towards variance. The result will look like table 2 with a
mean of zero.
Normalized Data
Next, calculate the covariance matrix for the data which would measure how two
predictors move together. It is measured between two predictors but if you have 3-
dimensional data (x, x1, x2), then measure the covariance between x x1, x x2, x1 x2. For
reference covariance formula is:
https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 2/4
9/7/2019 Principal Component Analysis- Intro - Towards Data Science
Covariance Matrix
Now, calculate Eigen values and Eigen vector of the above matrix. This helps in finding
underlying patterns in the data. In our case it would be approximately:
We are almost there :). Perform reorientation. To convert the data into new axes
multiply original data with eigenvectors, which suggests the direction of new axes. Note,
that you can choose to leave out smaller eigen vector or use both. Also, decide how many
set of features to keep based on which set accounts for 95% or more variance.
Finally, the scores calculated from above step can be plotted and and fed into the
predictive model. Plots gives us the sense of how close/highly correlated two variables
https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 3/4
9/7/2019 Principal Component Analysis- Intro - Towards Data Science
are. Instead of using original data to plot X and Y axis which doesn’t tell us much how
points are related to each other, we plot transformed data (using eigen vectors) that find
patterns and shows the relationships between points.
End Note: It is easy to confuse PCA with Factor Analysis but there is a conceptual
difference between these two methods. I will be going into details of Factor Analysis and
how it is different from PCA in my next post.. stay tuned.
https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 4/4