I was trying to get a better understanding of the “C” regularization term in SVM with a Gaussian kernel. When I tried some low values, I got VERY weird results (wait for it…)
Here is an animated gif with the problem:
The gif shows the behavior of the SVC using “predict” and “decision_function” methods. The background is the prediction and the points are the training values. So, what’s wrong?
- For low values of “C”, everything is predicted the same. Then suddenly, after a certain value, the prediction starts to work correctly.
- Same for the method “decision_function”. The documentation says it is the distance to the hyperplane, but it varies back and forth in the beginning, flickering and with some sudden changes. Why? I’d wish to know. If someone can explain that, pls, send me a “hello!” I suspect the flickering is due to some lack of normalization or scaling.
Now, let’s see the behavior of “predict_proba”.
Left-hand side uses “predict_proba” ≥ 0.5 to give a prediction and the right-hand side uses the pure value of “predict_proba” to give more ranges.
The behavior seems less erratic. As “C” increases the regions get smoothly more complicated. No sudden changes, no flickering. Also, there is a large range of C at the beginning that yields the same result. This is intriguing, it would behave the same for other datasets?
The docs say “predict_proba” uses Platt scaling and warns you about inconsistencies between it and the “predict” method. Also, it recommends you to avoid it with a small sample. And finally, advises against its use because Platt scaling has some theoretic issues.
So, for this specific example, let’s compare “predict” vs “predic_proba ≥ 0.5”:
Honestly, except by the low values of “C”, they are really similar, not equals, but really really similar.
What’s the conclusion? Well…
- It’s nice to see the effect of the “C” parameter. I’m going to post about it in more details later, I guess.
- The behavior of predict and decision_function is really odd. Maybe it’s something with this dataset, I need to investigate further. But that makes me think I should take extra care when using them. Especially when I’m trying to reduce overfitting.
- The predict_proba method has a very consistent behavior, I intuitively like that. Now I need to understand the theoretic issues with Platt scaling.
- Using predict and predict_proba ≥ 0.5 seems to make no significant difference, except by some lower values of C.
There is a long to-do list ahead. But I think the better way to understand the final effect of C and the strange behavior above is to try the same with synthetic datasets. Meaning I’ll create several datasets with artificial points and try the same to see if I figure out some pattern.
If you want to reproduce or extend it, here is the code: