Building #recotype, part 1: Weaving threads out of tweets

Earlier this year, I talked about the value of monitoring Twitter to extract valuable data. About a month ago, we finally decided to go ahead and build our application based on this concept. The core idea is pretty simple: people use the #reco hashtag to recommend things or ask for recommendations on Twitter. Our application, #recotype, simply monitors the Twitter stream for this hashtag and re-assembles the conversations around them, like so:

#recotype

The first challenge was to display #reco tweets as threaded conversations. Twitter API does not provide a way to retrieve a complete thread, or to search for the replies of a specific tweet. (This has been a feature request for the past 2 years but without a satisfactory resolution yet.) A number of sites were created for the express purpose of threading Twitter conversations, but we needed our own solution that would integrate with the rest of the application. Here's how we did it:

Listen to the Twitter stream

A background PHP application uses phirehose to connect to the Twitter Stream API and listens to tweets containing "reco" or "recos". These tweets are saved in the Drupal database, by normalizing them into the following records:

  • One record for the tweet
  • One record for the tweet author
  • One record for the original tweet, if the current tweet is actually a retweet
  • One record for the original author

In case of a retweet, the original tweet is considered to be the parent tweet, just like a reply. The retweet's in_reply_to_status_id is manually linked to the original.

Listen to registered users' timeline and mentions

Because we are looking for complete threads, not just tweets containing #reco, it is not enough to rely on the filtered Twistream above. Specifically, we would not be able to find the replies to a #reco tweet this way. Instead of listening to the whole, unfiltered Twistream, which would consume a lot of processing power and storage space, we opted to monitor registered users' timelines and mentions feeds. That's how we did it:

Create conversation threads out of the aggregated data

Now that we have all the data needed, another background process iterates over the unthreaded tweets and processes them, based on the in_reply_to_status_id attribute - which is also populated for the retweets as noted above. The result is placed in a separate hierarchy table that is optimized for fast processing. Each unprocessed tweet is enqueued in a work queue via Drupal Queue. These jobs are activated via calls to drush queue-cron, invoked by a continuously running Bash script.

Robust background processes

To cope with the multiple background processes, we created an infrastructure that can cope with failures:

  • We used Supervisor to daemonize the background processes and restart them when they fail or exit.
  • We used Monit to monitor Apache and MySQL.

Conclusion

Although #recotype is still in its infancy, building this first iteration was both fun and challenging. I hope I successfully conveyed some of the problems faced and the solutions that we used.

AttachmentSize
recotype.png282.73 KB