Principal Component Analysis - Intro - Towards Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

9/7/2019 Principal Component Analysis- Intro - Towards Data Science

Principal Component Analysis- Intro


Variable Reduction Technique
Anuja Nagpal Follow
Nov 21, 2017 · 3 min read

Too many variables? Should you be using all possible variables to generate model?

In order to handle “curse of dimensionality” and avoid issues like over-fitting in high
dimensional space, methods like Principal Component analysis is used.
https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 1/4
9/7/2019 Principal Component Analysis- Intro - Towards Data Science

PCA is a method used to reduce number of variables in your data by extracting


important one from a large pool. It reduces the dimension of your data with the aim of
retaining as much information as possible. In other words, this method combines highly
correlated variables together to form a smaller number of an artificial set of variables
which is called “principal components” that account for most variance in the data.

Let’s dive in to understand how to PCA is implemented behind the scene.

Start by normalizing the predictors by subtracting the mean from each data point. It is
important to normalize the predictor as original predictors can be on the different scale
and can contribute significantly towards variance. The result will look like table 2 with a
mean of zero.

Normalized Data

Next, calculate the covariance matrix for the data which would measure how two
predictors move together. It is measured between two predictors but if you have 3-
dimensional data (x, x1, x2), then measure the covariance between x x1, x x2, x1 x2. For
reference covariance formula is:
https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 2/4
9/7/2019 Principal Component Analysis- Intro - Towards Data Science

In our case covariance matrix would look like this:

Covariance Matrix

Now, calculate Eigen values and Eigen vector of the above matrix. This helps in finding
underlying patterns in the data. In our case it would be approximately:

Eigen Value and Vector

We are almost there :). Perform reorientation. To convert the data into new axes
multiply original data with eigenvectors, which suggests the direction of new axes. Note,
that you can choose to leave out smaller eigen vector or use both. Also, decide how many
set of features to keep based on which set accounts for 95% or more variance.

Finally, the scores calculated from above step can be plotted and and fed into the
predictive model. Plots gives us the sense of how close/highly correlated two variables

https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 3/4
9/7/2019 Principal Component Analysis- Intro - Towards Data Science

are. Instead of using original data to plot X and Y axis which doesn’t tell us much how
points are related to each other, we plot transformed data (using eigen vectors) that find
patterns and shows the relationships between points.

End Note: It is easy to confuse PCA with Factor Analysis but there is a conceptual
difference between these two methods. I will be going into details of Factor Analysis and
how it is different from PCA in my next post.. stay tuned.

Data Science Dimensionality Reduction Machine Learning Analytics

About Help Legal

https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38 4/4

You might also like