Using Drupal to analyze Twitter data

The Twitter dataset is an insanely valuable source of useful social networking information. That explains the huge number of Twitter-related sites that spring up, each one performing one specific analysis of Twitter data. The one I was looking at today is TwitterSheep which just creates a tagadelic-type cloud of keywords, from a user's followers' bios. Pretty neat idea, but it was frustrating that the Web application didn't let me actually see who of my followers are in which category - and make a list out of them, and email them, etc.

The problem with these many Web sites that come up is that developing such new functionality takes too much effort. Some of them are just the result of one weekend's coding spree, so there's only so much one can code during that time. But with a system like Drupal, we're dealing with a completely different state of affair: we're not coding Web features anymore, we're coding pure information processing. We have therefore much more time to focus on the data analysis if that's what we want to do. Let's get back to Twitter.

What we need is a module that:

  • Replicates the Twitter data model. It's not heady stuff: the external entities are User, Tweet and List. CCK will do nicely. But the important part is that they be content types, so that nodes be created out of them. In Drupal 6 at least, this is the easiest way to fully expose information to the full capacity of the modules.
  • Allows associating a Drupal account with one or more Twitter account.
  • Periodically reads the Twitter feed for each account and updates it own dataset with it.

Given these features and the wealth of modules available to analyze data (starting with Views of course), we could program Drupal to show a tag cloud, drill down to find the users, group them, email them, in about an hour. Show threaded conversations, search using Solr, add metadata, geo-map, 3 more hours. Isn't that great?

Now if we could only automate theming...

ThinkTank is another example of such an application, but distributed open source. It is interesting to see how they refresh their data from the Twitter dataset. The UI is programmed manually, making it awkward and fragile.