Similarity Graphs

Feb 26, 2009

The idea of reasoning about content to recommend as a similarity graph is quite widespread. Broadly speaking, you can start by drawing a set of circles (for users) on the left and a set of circles (for “items” - songs, movies..) on the right; when users rate/listen to/etc items, you draw an arrow from the corresponding left circle to the right circle (i.e. a bipartite graph). What collaborative filtering algorithms can do is project the two-sided graph to two equivalent representations, where users are linked to other users, and items are linked to other items based on how similar they are.

There are a bunch of places where this kind of abstraction has been used; for example, Oscar Celma used graphs to navigate users when discovering music in the long-tail. Paul Lamere posted graphs made with the EchoNest API on his blog. I’ve also dabbled in this area a bit, but not using music listening data; I was using (the more traditional) MovieLens and Netflix datasets. The question that comes to mind when reading about techniques that operate on the graph, though, is: are the underlying graphs real representations of similarity between content? What if the graphs are wrong?

A tiny (potentially biased) example shows this: on Last.fm, Led Zepp’s #2 similar artist is Pink Floyd, a band that does not appear in Paul Lamere’s graph (note: the latter graph has been pruned. So maybe this is a bad example, but you get the idea). Navigating on graphs for an artist made using two different datasets may lead you down very different paths of discovery..

This problem is rooted in the fact that different similarity measures don’t agree with one another: given two vectors of ratings, different measures will produce equally different similarity values (and distributions of similarity). As I looked at in this paper, this sometimes means that random-valued similarity produces pretty good predictive accuracy results. Another way of thinking about the problem may be looking at how the graph changes over time as users rate the content even more. One of the main results I poster-ed at RecSys ‘08 is that, in the case of movies being explicitly rated, this graph is subject to huge (and quick) change. Two users will be extraordinarily similar now, and drop out of each other’s neighbourhood later. Is this a good thing? How does this reflect on the collaborative filtering assumption of “users who have been like-minded in the past are likely to share similar future interests?”