Learning igraph for R (part 2) — actually, correlations and distributions
Another day, more learning.
(note: I barely touched igraph today, spent most of the time with correlations and distributions, sorry…)
First, I want to know if the strengths of my nodes are correlated to the degrees, if so, I can simplify my analysis using just degrees (or strengths).
Nice! My correlation between degree and strength is ~ 0.95! Almost perfect, so I can simplify my analysis. Now, would it possible to create a fit curve for the correlation? If so, I can derive the probable strength from the degree and vice-versa.
Let’s search a little. I can barely remember a “lm(y ~x)” from somewhere. Google to the rescue…
Ok, not that difficult, but the result is not as good as I expected:
Problem is, some outliers are causing trouble. I expected to see more residuals above the line :( These are the times I find parametric statistics kinda useless. Let me try to find a Theil-Sen estimator for R and see if it performs better…
Found a “mblm” package that seems to do what I want, but it extremely slow for my dataset :(
Another package “zyp” seems to do a better job, but I’m not in the mood the check it now, let’s try that sometime later.
My graph models job movements, as in “people work on this then moved to work on that”. Weight on links means how many people go from one job to another. The in-strength is the amount of people that comes to it and the out-strength means people leaving that job.
This way, I can identify if a job is “neutral” (almost the same number of people come and go from it), “accumulative” (more people in than out) or “distributive” (more people out than in).
But what is the “normal”? I need the see the big picture. Kernel Density Estimator to the rescue. Let’s search a little bit… Hmmm, “density” function, ok… Let’s try it.
Pretty good, I would say. Later I can fix the title and the labels. The distribution has three modes (peaks) and it’s slightly skewed to the right. More professions with people coming than going. I think the two modes at -1 and 1 are from jobs with few people, I’d like to compare that. I know R plots are pretty flexible, back to Internet to find a nice way to compare KDEs.
Found “purr” and “tidyverse”, that’s going to help a lot. After a decade of Ruby and heavy exposure to Scheme (Racket) it’s difficult to reason using “for” anymore =P
(after a couple of hours) Gosh! This purr package makes me happy ^_^
Finally!!! It took more time than I thought, but it is exactly what I wanted.
Now it’s clear that modes at -1 and 1 are from jobs with less movements (as I suspected). But, even considering only the biggest ones, the center of the distribution is a little offset to negative, meaning more people “coming” than “going”.
The final code is below.
Too much for today, but I learnt A LOT. So much that if I try to explain the code above (using “purr” lib), it would take another post =/
Tomorrow… try to generate a random model and maybe experimenting with confidence intervals using bootstrap.