Earlier this year, I talked about the value of monitoring Twitter to extract valuable data. About a month ago, we finally decided to go ahead and build our application based on this concept. The core idea is pretty simple: people use the #reco hashtag to recommend things or ask for recommendations on Twitter. Our application, #recotype, simply monitors the Twitter stream for this hashtag and re-assembles the conversations around them, like so:
The first challenge was to display #reco tweets as threaded conversations. Twitter API does not provide a way to retrieve a complete thread, or to search for the replies of a specific tweet. (This has been a feature request for the past 2 years but without a satisfactory resolution yet.) A number of sites were created for the express purpose of threading Twitter conversations, but we needed our own solution that would integrate with the rest of the application. Here's how we did it:
A background PHP application uses phirehose to connect to the Twitter Stream API and listens to tweets containing "reco" or "recos". These tweets are saved in the Drupal database, by normalizing them into the following records:
In case of a retweet, the original tweet is considered to be the parent tweet, just like a reply. The retweet's
in_reply_to_status_id is manually linked to the original.
Because we are looking for complete threads, not just tweets containing #reco, it is not enough to rely on the filtered Twistream above. Specifically, we would not be able to find the replies to a #reco tweet this way. Instead of listening to the whole, unfiltered Twistream, which would consume a lot of processing power and storage space, we opted to monitor registered users' timelines and mentions feeds. That's how we did it:
Now that we have all the data needed, another background process iterates over the unthreaded tweets and processes them, based on the
in_reply_to_status_id attribute - which is also populated for the retweets as noted above. The result is placed in a separate hierarchy table that is optimized for fast processing. Each unprocessed tweet is enqueued in a work queue via Drupal Queue. These jobs are activated via calls to
drush queue-cron, invoked by a continuously running Bash script.
To cope with the multiple background processes, we created an infrastructure that can cope with failures:
Although #recotype is still in its infancy, building this first iteration was both fun and challenging. I hope I successfully conveyed some of the problems faced and the solutions that we used.