YouTube Behavior

Apr 2, 2008

I read an interesting paper, “Identifying User Behavior in Online Social Networks,” by Marcelo Maia, Jussara Almeida and Virgílio Almeida. It was presented yesterday at the First International Workshop on Social Networks (co-located with EuroSys 2008).

The paper uses an interesting dataset: a social network based on the user-subscriptions on youtube. In other words, if I subscribe to your video uploads, then I link to you in the network. Here is a very brief summary: How is it possible to classify users according to different behaviors? An answer to this question would help specialists design their sites according to the target audience; however, trying to identify groups of similarly-behaved users based on individual attributes does not produce useful results. So what can be done? More informative traits can be used: the social interaction attributes. For example, consider the subscription network of youtube: considering each user’s in-degree (people who subscribe to that user’s content), out-degree (number of subscriptions), and reciprocity (mutual subscriptions), as well as number of uploads, watches, and channel views, allows for user behavior to be classified into five groups. The three main groups that appear are the content producers, consumers, and mixed producer/consumers. The last two are the old-possibly inactive users and those who small-degree/high clustering coefficient (the cliques).

The users in this dataset were classified using k-means, which typically relies on a pre-defined value of k to work. Another interesting contribution is a method that finds what k to use, based on balancing the proportion of inter- and intra- cluster distance properly (details in the paper). Of course, just like needing to specify k, a more general weakness of these techniques seems to be that you need to know what you are looking for before you can find any structure. In other words, if the authors had decided to cluster based on different social-interaction attributes (or social-net graph properties), maybe their results would have been remarkably different?

There are a lot of other interesting papers that use youtube datasets, including this one that looks at how content popularity on the site fluctuates.