A little note to myself on “why graph models” for data science.
Machine learning is dominated by statistical models. And that’s for a good reason, they perform quite well. Amazingly well, to be honest.
Graph models don’t get much attention, except maybe in the Neo4j community and for fraud detection. Here are 3 points where they shine over statistical models.
1 — It’s difficult for a statistical model to use signals with a small sample. A single phone number is associated with a few people. Attributes like that are not (commonly) used is non-graph models exactly due to that, but they are quite strong in graph models.
2— Statistical models don’t consider indirect relations. That’s clear on fraud detection: if a confirmed fraudster is using a phone number, and then someone else is using the same number, the other attributes of the second person are not considered a signal until that person is also confirmed a fraudster. Graph models allow us to have value for those indirect attributes.
3 — Statistical models don’t distinguish between strong and weak signals. Two users with the same phone number or email are tighter related than two sharing the same IP, or from the same burst in orders.
Overall, graph models have strengths where statistical are weak and vice-versa. Also, graph models are typically process-intensive because there is no clear “training vs evaluating” line. It’s all about graph updating. One of the biggest tricks is to separate them, so we have a “graph updating” phase that can be done in the background, while the “graph query” phase can quickly be used for production.