PCA is a technique used mainly with 3 goals:
- Deal with correlated features (a weakness of Linear Regression).
- Reduce dimensionality (number of features) without losing too much information.
- “See” a high dimension model in a 2D scatter plot.
It’s not difficult to find articles about it all over the internet, but I’m struggling to get an intuitive understanding of PCA. Almost everything I found tries to explain the linear algebra behind it, but don’t give too much insight (at least for my non-mathematical mind).
So, here I try to understand what it does and to give a non-mathematical explanation. Of course, I’ll commit several sins doing that and also I’ll tell some “half-truths”, but bear with me, it’s going to be useful.
So, what PCA does?
PCA rotates, flips, and/or scale the values in the samples. Moving the biggest difference between them to the “principal axes”.
In statistical parlance, “difference between samples” is called variance.
And we talked about “dimensionality reduction”, right? That means I can see a 3D shape in a 2D plot. That seems magical, but it’s just a collateral effect of moving the biggest differences to some axes.
We reduce dimensionality using PCA discarding axes. But the trick is to discard less relevant axes.
Enough blah, blah, blah. Let’s seem some examples.
A simple line
Let’s see what’s happens to a line in a PCA transformation.
Now, imagine the line is our data with 2 features, after applying PCA we can use the x-axis only because all points in the y-axis have value zero. The y-axis becomes useless.
PCA rotates a line to flatten it in the x-axis.
That doesn’t mean we can discard a feature in the original dataset. See in the original chart at the left-hand side that the points in y vary as much as in the x-axis. After we transform the data, we can discard the new useless y.
Sometimes we lose some information
However, we cannot simply discard the “useless” axis without losing some information.
In the example below, we displaced some points. We can see that the line was rotated and flipped. Essentially we have the same information and we can discard the y-axis. However, doing that we are going to lose some information.
Does it matter? It depends on the shape of our data, but if the model has high dimensionality, it probably will not be a problem.
The nice thing about PCA is that it provides enough information to us to see how much information are we losing.
On a vertical line, it flattens again
The point of PCA is to have a certain order of axes in a way that the first one has the biggest difference between the points, the second one a little less and so on. So, in a vertical line, it rotates the points so the x (our first axis) contains the biggest difference (the difference was in the y-axis before).
And why this is important?
If we are going to classify the data, the difference (variance) between the points will help us to make a clear separation. If points are too close, it’s more difficult to separate them. Moving the biggest difference to certain axes allow us to discard the ones that don’t help to differentiate the samples too much.
It always depends on the shape of your data
In the examples below, we see that splitting our previous vertical line gives very different results.
Again, if you have tons of features (high dimensionality), it will not be a problem to discard the less relevant axes. But each shape is a different story, sometimes we have to keep a good amount of them, sometimes we can discard a lot.
Non-linear data are interesting
If we try PCA on non-linear data, we can see more clearly that it rotates, flips and rescale the points.
It’s not easy to see the differences between the points in the original line, it seems to be a straight line with some noise, that’s because the points in y-axis don’t vary too much compared to the x-axis. When we rescale the y-axis we can finally see the true shape.
The points at the beginning of the line are more distinguishable on the y-axis, the points on the end are more distinguishable on the x-axis. After the PCA, the y-axis is not completely useless, but if we remove it, we’ll have the points in a flat line fairly well separated on the x-axis.
I hope that helped you to demystify PCA a little bit and get some intuition about what it does. It surely helped me and now I feel I can explain the linear algebra behind it (maybe a future post, or not).
Finally, I find very useful to try it on simple and artificial data to get a better understanding of the technique without dive in details yet. I encourage you to try that with anything new you are studying :)
Here is a link to the notebook where I ran the experiments. Feel free to add comments to it.